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Preface 



The 7th International Colloquium on Grammatical Inference (ICGI 2004) was 
held in the National Centre for Scientific Research “Demokritos” , Athens, Greece 
on October 11-13, 2004. ICGI 2004 was the seventh in a series of successful 
biennial international conferences in the area of grammatical inference. Previous 
meetings were held in Essex, UK; Alicante, Spain; Montpellier, France; Ames, 
Iowa, USA; Lisbon, Portugal; and Amsterdam, The Netherlands. This series 
of conferences seeks to provide a forum for the presentation and discussion of 
original research papers on all aspects of grammatical inference. 

Grammatical inference, the study of learning grammars from data, is an es- 
tablished research field in artificial intelligence, dating back to the 1960s, and has 
been extensively addressed by researchers in automata theory, language acquisi- 
tion, computational linguistics, machine learning, pattern recognition, computa- 
tional learning theory and neural networks. ICGI 2004 emphasized the multidis- 
ciplinary nature of the research field and the diverse domains in which grammat- 
ical inference is being applied, such as natural language acquisition, computa- 
tional biology, structural pattern recognition, information retrieval, Web mining, 
text processing, data compression and adaptive intelligent agents. 

We received 45 high-quality papers from 19 countries. The papers were re- 
viewed by at least two - in most cases three - reviewers. In addition to the 20 
full papers, 8 short papers that received positive comments from the reviewers 
were accepted, and they appear in a separate section of this volume. The top- 
ics of the accepted papers vary from theoretical results of learning algorithms 
to innovative applications of grammatical inference, and from learning several 
interesting classes of formal grammars to estimations of probabilistic grammars. 

In conjunction with ICGI 2004, a context-free grammar learning competition, 
named Omphalos, took place. In an invited paper in this volume, the organiz- 
ers of the competition report on the peculiarities of such an endeavor and some 
interesting theoretical findings. Last but not least, we are honored by the contri- 
butions of our invited speakers Prof. Dana Angluin, from Yale University, USA, 
and Prof. Enrique Vidal, from Universidade Politecnica de Valencia, Spain. 

The editors would like to acknowledge the contribution of the Program Com- 
mittee and the Additional Reviewers in reviewing the submitted papers, and 
thank the Organizing Committee for their invaluable help in organizing the 
conference. Particularly, we would like to thank Colin de la Higuera, Menno 
van Zaannen, Georgios Petasis, Georgios Sigletos and Evangelia Alexopoulou 
for their additional voluntary service to the grammatical inference community, 
through this conference. We would also like to acknowledge the use of the Cy- 
berclrair software, from Borbala Online Conference Services, in the submission 
and reviewing process. Finally, we are grateful for the generous support and 
sponsorship of the conference by NCSR “Demokritos”, the PASCAL and KD- 
net European Networks of Excellence, and Biovista: Corporate Intelligence in 
Biotechnology. 
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Learning and Mathematics 



Dana Angluin 



Yale University 

P.O. Box 208285, New Haven, CT 06520-8285, USA 

angluinOcs . yale . edu 

http : //www . cs .yale . edu/ people/faculty/ angluin.html 



Our formal models of learning seem to overestimate how hard it is to learn some 
kinds of things, including grammars. One possible reason for this is that our 
models generally do not represent learning a concept as an incremental addition 
to a rich collection of related concepts. This raises the question of how to make 
a good model of a “rich collection of related concepts.” Rather than start by 
trying to make a general model, or adapting existing formalisms (e.g., logical 
theories) for the purpose, I have undertaken an extended look at a particular 
domain, namely mathematics. Mathematics certainly qualifies as a rich collection 
of related concepts, and has the advantage of thousands of years of effort devoted 
to improving its representations and clarifying its interconnections. This talk will 
present some of the issues I have encountered, and will probably consist of more 
questions than answers. 

An anecdote will begin to raise some questions. At a workshop some years 
ago, a colleague asked me if I was familiar with the following problem. Given a 
nonempty finite set U of cardinality n, and two positive integers s < t < n, find 
the minimum cardinality of a collection C of subsets of U of size t such that 
every subset of U of size s is a subset of some element of C. Since I was not 
familiar with the problem, she continued to ask others at the workshop, until 
finally someone gave her the name of the problem and a pointer to work on it. 

The meaning of the problem is clear (to someone with some mathematical 
training) from a very short description. What kind of representation would it 
take for us to be able to give something like this description to a search engine 
and be referred to papers that dealt with it? We already are expected to make our 
papers available in machine readable form on the web, or risk their irrelevance. 
Perhaps some enhancement of that representation could make such searches 
possible? 

As another example, students in an elementary discrete mathematics course 
are often introduced to the concepts of permutations and combinations by means 
of concrete examples. Liu [1] asks the reader to imagine placing three balls, 
colored red, blue, and white, into ten boxes, numbered 1 through 10, in such 
a way that each box holds at most one ball. The problem is to determine the 
number of ways that this may be done. Lovasz, Pelikan and Vesztergombi [2] 
describe a party with seven participants, each of whom shakes hands once with 
each of the others, and ask how many handshakes there have been in total. An 
introductory textbook will typically contain many examples and exercises of this 
kind. 
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The situations used involve familiar elements, are easily imagined, and are 
intended to engage the student’s intuitions in helpful ways. However, some stu- 
dents find it quite difficult to get the hang of the implicit rules for these problems. 
What will not help such students is the customary explicit and detailed formal- 
ization of the domain as a logical theory. What might help would be a somewhat 
more concrete model in terms of actions and state spaces. This is reminiscent 
of Piaget’s emphasis upon an individual’s actions as a basis for more abstract 
understanding. 

These issues provide a window on other questions about mathematical rea- 
soning and representation. It is likely that we will make more and more use of 
computers to help us create and use mathematics. Questions of how best to do 
that are far from settled, and will require a deep understanding of the multi- 
tude of ways that people actually do mathematics. Ironically, those for whom 
mathematics is difficult may provide some of the clearest evidence of what is 
involved. 



References 

1. C. L. Liu. Elements of Discrete Mathematics. McGraw-Hill, 1977. 

2. L. Lovasz, J. Pelikan, and K. Vesztergombi. Discrete Mathematics: Elementary and 
Beyond. Springer, 2003. 
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Abstract. In formal language theory finite-state transducers are well- 
know models for “input-output” rational mappings between two lan- 
guages. Even if more powerful, recursive models can be used to account 
for more complex mappings, it has been argued that the input-output 
relations underlying most usual natural language pairs are essentially ra- 
tional. Moreover, the relative simplicity of these mappings has recently 
lead to the development of techniques for learning finite-state transduc- 
ers from a training set of input-output sentence pairs of the languages 
considered. Following these arguments, in the last few years a number 
of machine translation systems have been developed based on stochas- 
tic finite-state transducers. Here we review the statistical statement of 
Machine Translation and how the corresponding modelling, learning and 
search problems can be solved by using stochastic finite-state transduc- 
ers. We also review the results achieved by the systems developed under 
this paradigm. After presenting the traditional approach, where trans- 
ducer learning is mainly solved under the grammatical inference frame- 
work, we propose a new approach where learning is explicitly considered 
as a statistical estimation problem and the whole stochastic finite-state 
transducer learning problem is solved by expectation maximisation. 



1 Introduction 

Machine translation (MT) is one of the most appealing (and challenging) ap- 
plications of human language processing technology. Because of its great social 
and economical interest, in the last 20 years MT has been considered under al- 
most every imaginable point of view: from strictly linguistics-based methods to 
pure statistical approaches including, of course, formal language theory and the 
corresponding learning paradigm, grammatical inference (GI) . Different degrees 
of success have been achieved so far using these approaches. 

Basic MT consists in transforming text from a source language into a target 
language, but several extensions to this framework have been considered. Among 
the most interesting of these extensions are speech-to-speech MT (STSMT) and 

* This work was partially supported by the European Union project TT2 (IST-2001- 
32091) and by the Spanish project TEFATE (TIC 2003-08681-C02-02). 
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computer assisted (human) translation (CAT). In STSMT, which is generally 
considered significantly harder than pure text MT, the system has to accept 
a source-language utterance and produce corresponding human-understandable 
target-language speech. In CAT, on the other hand, the input is source-language 
text and both the system and the human translator have to collaborate with each 
other in an attempt to produce high quality target text. 

Here we consider MT, STSMT and CAT models that can be be automat- 
ically learned through suitable combinations of GI and statistical methods. In 
particular we are interested in stochastic finite-state transducers. Techniques for 
learning these models have been studied by several authors, in many cases with 
special motivation for their use in MT applications. [1-12]. 

2 General Statement of MT Problems 

The {text-to-text) MT problem can be statistically stated as follows. Given a 
sentence s from a source language, search for a target-language sentence t which 
maximises the posterior probability 1 : 

t = argmaxPr(t | s) . (1) 

t 

It is commonly accepted that a convenient way to deal with this equation is 
to transform it by using the Bayes’ theorem: 

t = argmaxPr(t) • Pr(s 1 1) , (2) 

t 

where Pr(t) is a target language model - which gives high probability to well 
formed target sentences - and Pr(s 1 1) accounts for source-target word(-position) 
relations and is based on stochastic dictionaries and alignment models [13,14]. 

Alternatively the conditional distribution in Eq. 1 can be transformed into a 
joint distribution: 

t = argmaxPr(s, t) , (3) 

t 

which can be adequately modelled by means of stochastic finite-state transducers 
(SFST) [15]. This is the kind of models considered in the present work. 

Let us now consider the STSMT problem. Here an acoustic representation 
of a source-language utterance x is available and the problem is to search for a 
target-language sentence t that maximises the posterior probability 2 : 

t = argmaxPr(t | x) . (4) 

t 

Every possible decoding of a source utterance x in the source language can be 
considered as the value of a hidden variable s [15] and, assuming Pr(x|s,t) does 
not depend on t, Eq. 4 can be rewritten as: 

1 For simplicity, Pr(X = x) and Pr(X = x \ Y = y) are denoted as Pr(x) and Pr(* | y). 

2 From t, a target utterance can be produced by using a text-to- speech synthesiser. 
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t = argmax^^ Pr(s, t) • Pr(x|s) . (5) 

t s 

As in plain MT, Pr(s,t) can be modelled by a SFST. The term Pr(x|s), on the 
other hand, can be modelled through hidden Markov models (HMM) [16], which 
are the standard acoustic models in automatic speech recognition. Thanks to the 
homogeneous finite-state nature of both SFST and HMMs, and approximating 
the sum with a maximisation, Eq. 5 can be easily and efficiently solved by the 
well-known Viterbi algorithm [15]. 

Finally, let us consider a simple statement of CAT [17]. Given a source text s 
and a fixed prefix of the target sentence t p -previously validated by the human 
translator-, the problem is to search for a suffix of the target sentence t s that 
maximises the posterior probability: 

t s = argmaxPr(t s | s,t p ) . (6) 

t. 

Taking into account that Pr(t p | s) does not depend on t s , we can write: 

t s = argmaxPr(s,t p t s ) , (7) 

t s 

where t p t s is the concatenation of the given prefix t p and a suffix t s suggested 
by the system. Eq. 7 is similar to Eq. 3, but here the maximisation is constrained 
to a set of suffixes, rather than full sentences. As in Eq. 3, this joint distribution 
can be adequately modelled by means of SFSTs [18]. 

All the above problem statements share the common learning problem of 
estimating Pr(s, t), which can be approached by training a SFST from a parallel 
text corpus. 

3 Stochastic Finite-State Transducers 

Different types of SFSTs have been applied with success in some areas of machine 
translation and other areas of natural language processing [3,19,4,8,11,9,20]. 
Here only conventional and subsequential SFSTs are considered. A SFST Tp is 
a tuple (A, A, Q, qo,Pr, fr), where A is a finite set of source words, A is a finite 
set of target words, Q is a finite set of states, qo is the initial state and pt and 
/t are two functions pr ■ Q x £ x A* x Q — > [0, 1] ( transition probabilities ) and 
fr-Q — > [0,1] ( final-state probability ), that verify: 

V<?€ Q, fr{q) + PT(q,a,u,q') = 1. 

(a,uj,q')€.I2x A* xQ 

Given Tp, the joint probability of a pair (s,t) € S* x A* -denoted as 
Pr7- p (s,t)- is the sum of the probabilities of all sequences of states that deal 
with (s,t); that is, the concatenation of the source (target) words of the transi- 
tions between each pair of adjacent states in the sequence of states is the source 
sentence s (target sentence t) [21]. The probability of a particular state sequence 
is the product of the corresponding transition probabilities, times the final-state 
probability of the last state in the sequence [21]. 
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If Tp has not useless states, Pr t p (s, t) describes a probability distribution on 
S* x A* which is called stochastic regular translation. This distribution is used 
to model the joint probabilities introduced in Eq. 3 in the previous section. 

Given a SFST Tp, a non-probabilistic counterpart T, called characteristic 
finite-state transducer of Tp (FST), can be defined. The transitions are those 
tuples in Q x E x A* x Q with probability greater than zero and the set of final 
states are those states in Q with final-state probability greater than zero. 

A particularly interesting transducer model is the subsequential transducer 
(SST) which is a finite-state transducer with the basic restriction of being de- 
terministic. This implies that if two transitions have the same starting state and 
the same source word, then both ending states are the same state and both the 
target strings are also the same target string. In addition, SSTs can produce a 
target substring when the end of the input string has been detected. 

Any SFST has two embedded stochastic regular languages, one for the source 
alphabet and another for the target alphabet. These languages correspond to the 
two marginals of the joint distribution modelled by the SFST [21]. 

SFSTs exhibit properties and problems similar to those exhibited by stochas- 
tic regular languages. One of these properties is the formal basis of the GIATI 
technique for transducer inference presented in Section 6. It can be stated as the 
following theorem [21]: Each stochastic regular translation can be obtained from 
a stochastic local language and two alphabetic morphisms. The morphisms allow 
for building the components of a pair in the regular translation from a string in 
the local language [21]. 

On the other hand, one of the problems of SFSTs is the stochastic translation 
problem [22]: given a SFST Tp and s G A*, search for a target string t that 
maximise Prj- p (s,t). While this is proved to be a NP-Hard problem [22], a 
generally good approximation can be obtained in polynomial time through a 
simple extension to the Viterbi algorithm [10]. 

4 Learning Stochastic Finite-State Transducers 

Following the statistical framework adopted in the previous sections, three main 
families of techniques can be used to learn a SFST from a parallel corpus of 
source-target sentences: 

— Traditional syntactic pattern recognition paradigm: a) Learn the SFST 
“topology” (the characteristic transducer ) and b) Estimate its probabilities 
from the same data. 

— Hybrid methods: Under the traditional paradigm, use statistical methods to 
guide the structure learning. 

— Pure statistical approach: a) Adequately parametrise the SFST structure and 
consider it as a hidden variable and b) estimate everything by expectation 
maximisation (EM). 

To estimate the probabilities in the traditional approach, maximum likelihood 
or other possible criteria can be used [10]. As in every estimation problem, 
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an important issue is the modelling of unseen events. In our case, a general 
approach to this smoothing problem consists of using stochastic error- correcting 
parsing [23,24]. Alternatively, it can be tackled as in language modelling; either 
in the estimation of the probabilities of the SFST themselves [25] or within of the 
process of learning both the structural and probabilistic SFST components [20] . 

5 Traditional Syntactic Pattern Recognition Paradigm: 
OSTIA 

The formal model of translation used in this section is the SST. A transducer 
of this kind can delay the production of target words until enough source words 
have been seen to determine the correct target words. Therefore, the sates of a 
SST hold the “memory” of the part of the sentence seen so far. This allows the 
whole context of a word to be taken into account, if necessary, for the translation 
of the next word. A very efficient technique for automatically learning these 
models from a training set of sentence pairs is the so called onward subsequential 
transducer inference algorithm (OSTIA) [3]. 

OSTIA starts building up an initial representation of the original paired 
data in the form of a tree (the onward prefix tree transducer ) . Then appropriate 
states of this transducer are merged to build a FST in which those transitions 
sharing some structural properties are merged together in oder to generalise the 
seen samples. To this end, the tree is traversed level by level and the states 
in each level are considered to be merged with those previously visited. Only 
those pairs of states which are compatible according to the output strings of 
their subtrees are effectively merged. If the training pairs were produced by an 
unknown SST, T, -which can be considered true, at least for many common 
pairs of natural languages- and the amount of training data is sufficiently large 
and/or representative, then OSTIA is guaranteed to converge to a canonical 
{onward) SST which generates the same translation pairs as T [3]. 

The state merging process followed by OSTIA tries to generalise the training 
pairs as much as possible. This often leads to very compact transducers which 
adequately translate correct source text of the learned task into the target lan- 
guage. However, this compactness often entails an excessive generalisation of 
the source and the target languages, allowing meaningless source sentences to 
be accepted, and even worse target sentences to be produced. This is not a prob- 
lem for perfectly correct source text, but becomes important when not exactly 
correct text or speech is to be used as input. 

A possible way to overcome this problem consists in further restricting state 
merging so that the resulting SST only accepts source sentences and produces 
target sentences that are consistent with given source and target (regular) lan- 
guage models. These models are known as domain and range language models, 
respectively. A version of OSTIA called OSTIA-DR [26,4] enforces these restric- 
tions in the learning process. TV-grams [16], generally trained from the source 
and target sentences in the given corpus, are usually adopted as domain and 
range language models for OSTIA-DR. 
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OSTIA and OSTIA-DR have been applied to many relatively simple MT 
tasks, including speech-input MT. The first works, reported in [27], were carried 
out on the so called MLA and EuTrans-0 tasks. 

MLA ( miniature language acquisition) was a small task, with vocabularies 
of about 30 words, involving the Spanish-English translation of sentences used 
to describe and manipulate simple visual scenes. OSTIA-DR achieved almost 
perfect text-input results on this simple task by training with with a large number 
of pairs (50,000). Speech-input experiments were also carried out, leading to very 
good performance: less than 3% translation word error rate (TWER) 3 . 

EuTrans-0, called traveller task in [27], was a much larger and practically 
motivated task established in the framework of the European Union speeclr-to- 
speech MT project EuTrans [28]. The task involved human-to-human com- 
munication situations in the front-desk of a hotel. A large parallel corpus for 
this task was produced in a semi-automatic way [8]. The resulting Spanish- 
English 4 corpus contained 500, 000 (171, 481 different) sentence pairs, with Span- 
ish/English vocabulary sizes of 689/514 words and test-set bigram perplexities 
of 6. 8/5. 6 , respectively. Since the total size of this corpus was considered unreal- 
istically large, a much reduced corpus, called EuTrans-I, was built by randomly 
selecting 10/c (6, 813 different) sentence pairs for training and 3 k (all different) 
for testing. 

Very good results, in both text- and speech-input experiments with 
EuTrans-0 were reported in [27] and additional results, both for EuTrans- 
0 and EuTrans-I, can be seen in [8]. Using a categorised 5 version of the huge 
EuTrans-0 training corpus, OSTIA-DR produced almost perfect models, with 
text- and (microphone) speech-input TWER lower than 1% and 2%, respec- 
tively. However, results degraded significantly when the more realistically sized 
EuTrans-I training corpus was used. Since the learned models were clearly 
under-trained, error-correcting smoothing was needed in this case, leading to a 
text-input TWER close to 10% [28] . Without using categories, no useful results 
were obtained. 



6 Hybrid Methods: OMEGA and GIATI 

OSTIA has proved able to learn adequate transducers for real (albeit limited) 
tasks if a sufficiently large amount of training pairs is available. However, as the 
amount of training data shrinks, its performance drops dramatically. Clearly, 
in order to convey enough information to learn structurally rich transducers, 
prohibitively large amounts of examples are required. Therefore, in order to 

3 TWER is (a rather pessimistic measure) computed as the minimum number of word 
insertions, substitutions and deletions needed to match the system output with a 
single target sentence reference. 

4 Similar Spanish-German and Spanish-Italian corpora were also produced. For the 
sake of brevity, only Spanish-English data and results will be discussed here. 

5 Seven categories: proper names, numbers, dates and times of day, etc. Each instance 
of these categories was substituted by a corresponding non-terminal symbol. 
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render training data demands realistic, additional, explicit information about the 
relation of source-target words involved in the translation seems to be needed. 
The two techniques discussed in this section are explicitly based on this idea. 

OMEGA 

This algorithm, called “OSTIA modified for employing guarantees and align- 
ments” (OMEGA) [6], is an improvement over OSTIA-DR to learn SSTs. 

As with OSTIA, there are two main training phases: building an initial tree 
from the training pairs and state-merging. Apart from the OSTIA-DR state- 
merging restrictions, including those derived from (n-gram) source and target 
language models, two additional knowledge sources are employed to avoid over- 
generalisation: a bilingual dictionary and word alignments. They are used to 
ensure that target words are only produced after having seeing the source words 
they are translation of. These dictionaries and alignments can be obtained from 
the training pairs by means of pure statistical methods such as those described 
in [13,29], 

To enforce the new constraints, each state is labelled with two sets: the 
“guarantees” and the “needs”. The first set indicates which words can appear in 
the output because the corresponding input has already been seen. The second 
set contains those words that should appear since they will be output somewhere 
along the target subsequences departing from the state [6] . 

OMEGA was tested on the EuTrans-0 and EuTrans-I corpora described 
above. While results on EuTrans-0 were similar to or slightly worse than those 
of OSTIA, on the much smaller EuTrans-I corpus OMEGA was clearly better. 
Without using categories, OMEGA achieved error-correcting text-input TWER 
better than 7% [28] whereas, under the same conditions, OSTIA completely 
failed to produce useful results. OMEGA transducers learned with EuTrans-I 
were also tested with speech-input, achieving moderately good results of less than 
13% and 18% TWER for microphone and telephone speech, respectively [15]. 
These results correspond to 336 Spanish test utterances from several speakers. 

Apart from the Spanish-English corpora considered so far, an additional 
Italian-Englislr MT corpus was produced in the EuTrans project [28]. This 
corpus, referred to as EuTrans-II, corresponds to a task significantly more com- 
plex and closer real life than those previously considered. In this case, a speech 
corpus was acquired by recording real phone calls to the (simulated) front desk 
of a hotel. An associated text corpus was obtained by manually transcribing 
the acquired Italian utterances and translating them into English. The resulting 
corpus is much smaller than the previous ones, while having 4 times larger vo- 
cabularies (2,459/1,701 words). From this corpus, approximately 3,000 pairs of 
sentences were used for training the translation model and about 300 sentences 
(and the corresponding utterances, from 24 speakers) were used for testing. 

OSTIA could not be used at all in this task and OMEGA produced only 
moderately acceptable results: less than 38% and 42% TWER for text and speech 
input, respectively [6,15]. This prompted the need for even less data-hungry 
transducer learning techniques, leading to the approach discussed below. 
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GIATI 

This approach, called grammatical inference and alignments for transducer in- 
ference (GIATI), also makes use of information obtained by means of pure sta- 
tistical methods. However, in this case, the alignments are used to define an 
adequate bilingual segmentation of the training pairs of sentences. 

Given a finite sample of string pairs, GIATI relies on the two-morphisms 
theorem mentioned in Section 3 [21], to propose the following steps for the 
inference of a transducer [22] : 

1. Using a bilingual segmentation of each training pair of sentences, the pair is 
transformed into a single string from an extended alphabet. 

2. A stochastic regular grammar, typically an n-gram, is inferred from the set 
of strings obtained in first step. 

3. The terminal symbols of the grammar rules from the second step are trans- 
formed back into source/target symbols by applying adequate morphisms. 
This converts the stochastic grammars into the learned transducer. 

The main problem with this approach is the first (and correspondingly the third) 
step(s), i.e. to adequately transform a parallel corpus into a string corpus. The 
transformation of the training pairs must capture the correspondences between 
words of the input and the output sentences and must allow the implementation 
of the inverse transformation of the third step. As previously mentioned, this is 
achieved with the help of bilingual segmentation [20] . 

The probabilities associated with the transitions of a SFST learned in this 
way are just those of the corresponding stochastic regular grammar inferred in 
step 2. Therefore, an interesting feature of GIATI is that it can readily make use 
of all the smoothing techniques known for n-grams [30] and stochastic regular 
grammars [25]. Other interesting properties of GIATI can be found in [21]. 

This technique was tested with all the corpora of the EuTrans project. 
TWER smaller than 7% and 25% were obtained for text input with EuTrans-I 
and EuTrans-II, respectively [20]. For telephone speech-input, on the other 
hand, less 13% and 30% TWER were obtained on the same corpora, and less than 
8% for microphone speech with EuTrans-I [20,15]. These results, particularly 
those of EuTrans-II, are clearly better than those achieved by OMEGA under 
the same conditions. Overall, GIATI was among the best techniques tested in 
the framework of the EuTrans project [28]. 

Recently, GIATI has also been used in the context of the computer assisted 
translation (CAT) project Trans-Type 2 (TT2) [18]. One of the corpora consid- 
ered in this project, referred to as “Xerox”, contains a collection of technical 
Xerox manuals written in English, Spanish, French and German. The sizes of 
the training subsets are around 600,000 words for each language. In this case 
performance is assessed in terms of the Key-Stroke Ratio (KSR). It measures 
the percentage of keys-strokes that a human translator has to type with respect 
to the those needed to type the entire text without the help of the CAT system. 

Results are promising. For English to Spanish translation, KSR smaller 26% 
have been achieved, while KSR about 30%, 54%, 54%, 60% and 54% have been 
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obtained for Spanish to English, English to French, French to English, English 
to German and German to English, respectively [18]. Apart from these results 
GIATI has been used as the basis of one of the prototypes which have shown 
better practical behaviour according to real human tests on the Xerox task [18]. 



7 Pure Statistical Approach: GIATI Revisited 

The original GIATI technique was developed from some properties of the formal 
language theory [21], however, this technique can also be derived from a pure 
statistical point of view. Such derivation has the advantage that it no longer has 
to relay on “external” statistical techniques to obtain a bilingual segmentation or 
make use of heuristics to transform pairs of sentences into conventional strings. 

Let J and 7 be the given lengths of the source and the target sentences, 
respectively. Both of these sentences are assumed to be segmented into the same 
number of segments, 77 , for all possible 77. Segmentations can be described as 
two functions, 7 and /i, for the source and the target sentence, respectively: 

7 : {1, ..., K} —> {1, ..., J} with 7 fc+ i > j k for 1 < k < 77 and 7 K = J , 

/j : {1, ..., 77} — > {1, ...,/} with Hk + 1 > Mfc f° r 1 < k < K and hk = I . 

Under these assumptions, Eq. 3 can be rewritten as 6 : 



Pr(s,t) = Pr(J,7) • y^Pr(77|J,7) • ^ PiW, t}, yf , /zf | J, 7, 77) , (8) 

k 7 

Assuming that Pr(jf^, | J, 7, K) is uniform and that the correspondence among 

source and target segments is one-to-one and monotone, the last term in the right 
side of Eq. 8 can be rewritten as: 



P r (sf,t(,7f ,/rf |J,7,77) 

K 



OC 



|Pr(s 



,7 k 

'7fc-i+l 






Mfc-i + 1 



|./,7,a, s ; 



lk—1 A.Vk-1 _,K ,,K 



) 7i ) 7*1 ) • 



(9) 



k = 1 



A convenient type of segmentation is when 7 j = j for 1 < j < J; that is, 
source word-by-word. Then K = J and it is not necessary to use yf- explicitly: 



j 

Pv(s{,t[,^\JJ,K)cx]JPr(s k X^_ 1+1 \JJ,K,st i y i k -\^) . (10) 

k=l 

The right part of Eq.10 can be approximated by using n-grams of the se- 
quences of monotonically paired segments. This amounts to representing the 
right side of the conditional probabilities, ( J, 7, 77, s^ -1 > , /rf ), by means of 

the concept of (equivalence class of) “history” (77( J, 7, 77, s^ -1 ) t^ -1 , /if-)) [16]. 

6 Following a notation used in [13], a sequence of the form is denoted as 

z\ . For some positive integers N and M, the image of a function / : {1, ..., N} — > 
{1, ..., M} for n is denoted as f n , and all the possible values of the function as f\ . 
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With n-grams in particular and with stochastic finite-state automata in general, 
the equivalence classes of histories correspond to the states of the model. Finally, 
if it is assumed that Pr (</,/) and Pr(AT| J, I) in Eq. 8 are single parameters that 
are independent of J, I and K, Eq. 8 can be written as: 



j 

Pr(s, t) oc . in) 

K k = 1 

Given a set of training pairs, all the probabilities of the Eq. 11 can be esti- 
mated using the EM algorithm [31]. Note that the existence of hidden variables 
related to the assumed bilingual segmentation makes EM estimation the method 
of choice, instead of using simple relative frequency. 

On the other hand, as an estimation of a joint distribution, Eq.ll suggests 
an approximate implementation as a SFST. The states of the SFST are the 
equivalence classes of histories H( J, I, K, S+ 1 , , /+), and the probabilities 

in Eq. 11 correspond to the transition probabilities between states with as 
source word and t^_ i+1 as target string. 

The extension of this procedure to the most general setting of Eq. 9, which 
deals with source strings rather than single source words, is straightforward. 

In practice, when SFSTs are used, the finite-length characteristic of source 
and target strings is modelled through the explicit introduction of a special “end” 
symbol $, which is not in the source or the target language: s$ and t$, instead of 
explicit length modelling (/, J) . This is the usual way to represent finite-length 
strings when n-gram models are used. In this case, and using n-grams, Eq. 11 
becomes: 



Pr(s, t) oc eeii Pr(sfc, t 

K 






k - 1 



/^fc — l + ll k — n+1 ’ Mfc — n+ 



t^ + 1 )-Pr($,$|s^_„ + 2 ,t^_ n+2 ). 



(12) 

The states of the SFST are all possible (sj!I* +1 , tj^I* +1 ) in the training 
set 7 . The probabilities in Eq. 12 correspond to transition probability between 
two states with s;,, as source symbol and as the target string associated 

to the transition are similar to those in Eq. 11 and the probability that a state 



is a final state is Pr (Ml s fc- 



k— 1 i 

n+1’ Vfc 



- 1 ) 
— n+l / * 



An important difference of this proposal, with respect to the version of GIATI 
previously described in section 6, is the number of segmentations that are rep- 
resented in the model: In the present case, all possible segmentations of the 
training set are considered, while in the previous one, only one segmentation 
was used: the one derived from the best word alignments between source and 
target strings obtained by an external statistical alignment algorithm. 



7 Note that varying K and /+ , all the target segments from training set are generated. 
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8 Conclusions 

A number of techniques to learn stochastic finite-state transducers for machine 
translation have been reviewed. The review has started with techniques mainly 
falling under the traditional grammatical Inference paradigm. While these tech- 
niques have proven able to learn very adequate MT models for non-trivial tasks, 
the amount of training sentence pairs needed often becomes prohibitive in real- 
world situations. Other techniques are also reviewed that circumvent this prob- 
lem by increasingly relying on statistically-derived information. 

As task complexity increases, we think that statistically-based learning is the 
most promising framework, particularly when data becomes scarce, which is the 
typical condition encountered in practical situations. Consequently, in the final 
part of this article, one of the reviewed techniques has been revisited so as to 
derive it using only statistical arguments. 

All the reviewed techniques have been tested on practical MT tasks consid- 
ered in many Spanish and European projects and company contracts, involving 
a large variety of languages such as Spanish, English, Italian, German, French, 
Portuguese, Catalan and Basque. As a result of these projects a number of pro- 
totypes have been implemented and successfully tested under real (or at least 
realistic) conditions. On-line demonstrations of some of these prototypes are 
available at http://prhlt.iti.es/demos/demos.htm. 

References 

1. Vidal, E., Garcia, P., Segarra, E.: Inductive learning of finite-state transducers for 
the interpretation of unidimensional objects. In Mohr, R., Pavlidis, T., Sanfeliu, 
A., eds.: Structural Pattern Analysis. World Scientific pub (1989) 17-35 

2. Knight, K., Al-Onaizan, Y.: Translation with finite-state devices. In: Proceedings 
of the 4th. ANSTA Conference. (1998) 

3. Oncina, J., Garcia, P., Vidal, E.: Learning subsequential transducers for pattern 
recognition interpretation tasks. IEEE Transactions on Pattern Analysis and Ma- 
chine Intelligence 15 (1993) 448-458 

4. Castellanos, A., Vidal, E., Varo, A., Oncina, J.: Language Understanding and 
Subsequential Transducer Learning. Computer Speech and Language 12 (1998) 
193-228 

5. Makinen, E.: Inferring finite transducers. Technical Report A-1999-3, University 
of Tampere (1999) 

6. Vilar, J.M.: Improve the learning of subsequential transducers by using alignments 
and dictionaries. In: Grammatical Inference: Algorithms and Applications. Volume 
1891 of Lecture Notes in Artificial Intelligence. Springer- Verlag (2000) 298-312 

7. Casacuberta, F.: Inference of finite-state transducers by using regular grammars 
and morphisms. In: Grammatical Inference: Algorithms and Applications. Volume 
1891 of Lecture Notes in Computer Science. Springer- Verlag (2000) 1 -14 

8. Amengual, J., Benedi, J., Casacuberta, F., Castano, A., Castellanos, A., Jimenez, 
V., Llorens, D., Marzal, A., Pastor, M., Prat, F., Vidal, E., Vilar, J.: The EuTrans-I 
speech translation system. Machine Translation 15 (2000) 75-103 

9. Alshawi, H., Bangalore, S., Douglas, S.: Learning dependency translation models 
as collections of finite state head transducers. Computational Linguistics 26 (2000) 




14 



Enrique Vidal and Francisco Casacuberta 



10. Pico, D., Casacuberta, F.: Some statistical-estimation methods for stochastic finite- 
state transducers. Machine Learning 44 (2001) 121-141 

11. Bangalore, S., Riccardi, G.: A finite-state approach to machine translation. In: 
Proceedings of the North American ACL2001, Pittsburgh, USA (2001) 

12. Casacuberta, F., Vidal, E.: Machine translation with inferred stochastic finite-state 
transducers. Computational Linguistics 30 (2004) 205-225 

13. Brown, P.F., Pietra, S.A.D., Pietra, V.J.D., Mercer, R.L.: The mathematics of 
statistical machine translation: Parameter estimation. Computational Linguistics 
19 (1993) 263-311 

14. Ney, H., Niefien, S., Och, F.J., Sawaf, H., Tillmann, C., Vogel, S.: Algorithms for 
statistical translation of spoken language. IEEE Transactions on Speech and Audio 
Processing 8 (2000) 24-36 

15. Casacuberta, F., Ney, H., Och, F.J., Vidal, E., Vilar, J.M., Barrachina, S., Garcfa- 
Varea, I., Llorens, D., Martinez, C., Molau, S., Nevado, F., Pastor, M., Pico, D., 
Sanchis, A., Tillmann, C.: Some approaches to statistical and finite-state speech- 
to-speech translation. Computer Speech and Language 18 (2004) 25-47 

16. Jelinek, F.: Statistical Methods for Speech Recognition. The MIT Press, Cam- 
bridge, Massachusetts (1998) 

17. Langlais, P., Foster, G., Lapalme, G.: TransType: a computer-aided translation 
typing system. In: Proceedings of the Workshop on Embedded Machine Translation 
Systems (NAACL/ANLP2000), Seattle, Washington (2000) 46-52 

18. Civera, J., Vilar, J., Cubel, E., Lagarda, A., Casacuberta, F., Vidal, E., Pico, 
D., Gonzalez, J.: A syntactic pattern recognition approach to computer assisted 
translation. In Fred, A., Caelli, T., Campilho, A., Duin, R.P., de Ridder, D., eds.: 
Advances in Statistical, Structural and Syntactical Pattern Recognition. Lecture 
Notes in Computer Science. Springer- Verlag, Lisbon (2004) 

19. Mohri, M.: Finite-state transducers in language and speech processing. Computa- 
tional Linguistics 23 (1997) 269-311 

20. Casacuberta, F., Vidal, E.: Machine translation with inferred stochastic finite-state 
transducers. Computational Linguistics 30 (2004) 205-225 

21. Casacuberta, F., Vidal, E., Pico, D.: Inference of finite-state transducers from 
regular languages. Pattern Recognition (2004) In press 

22. Casacuberta, F., de la Higuera, C.: Computational complexity of problems on 
probabilistic grammars and transducers. In: Grammatical Inference: Algorithms 
and Applications. Volume 1891 of Lecture Notes in Computer Science., Springer- 
Verlag (2000) 15-24 

23. Amengual, J., Vidal, E.: Efficient Error-Corecting Viterbi Parsing. IEEE Transac- 
tions on Pattern Analysis and Machine Intelligence 20 (1998) 1109-1116 

24. Amengual, J., Sanchis, A., Vidal, E., Benedf, J.: Language simplification through 
error-correcting and grammatical inference techniques. Machine Learning 44 
(2001) 143-159 

25. Llorens, D., Vilar, J.M., Casacuberta, F.: Finite state language models smoothed 
using n-grams. International Journal of Pattern Recognition and Artificial Intelli- 
gence 16 (2002) 275-289 

26. Oncina, J., .Varo, M.: Using domain information during the learning of a subse- 
quential transducer. In: Grammatical Inference: Learning Syntax from Sentences. 
Volume 1147 of Lecture Notes on Computer Science. (1996) 313-325 

27. Vidal, E.: Finite-State Speech-to-Speech Translation. In: Proceedings of the In- 
ternational Conference on Acoustics Speech and Signal Processing (ICASSP-97), 
proc., Vol.l, Munich (1997) 111 114 




Learning Finite-State Models for Machine Translation 



15 



28. EuTrans: Example-based language translation systems. Final report. Technical re- 
port, Instituto Tecnologico de Informatica, Fondazione Ugo Bordoni, Rheinisch 
Westfalische Technische Hochschule Aachen Lehrstuhl fiir Informatik VI, Zeres 
GmbH Bochum: Long Term Research Domain, Project Number 30268 (2000) 

29. Och, F., Ney, H.: A systematic comparison of various statistical alignment models. 
Computational Linguistics 29 (2003) 19-51 

30. Ney, H., Martin, S., Wessel, F.: Statistical language modeling using leaving-one- 
out. In Young, S., Bloothooft, G., eds.: Corpus-Based Statiscal Methods in Speech 
and Language Processing. Kluwer Academic Publishers (1997) 174-207 

31. Moon, T.K.: The expectation-maximization algorithm. IEEE Signal Processing 
Mahazine (1996) 47-59 




The Omphalos Context-Free Grammar 
Learning Competition 



Bradford Starkie 1 , Frangois Coste 2 , and Menno van Zaanen 3 
1 Telstra Research Laboratories 

770 Blackburn Rd Clayton, Melbourne, Victoria, Australia 3127 
Brad. Starkie@telstra. com. au 
http : //www. cs .newcastle . edu. au/~bstarkie 
2 IRISA, Campus de Beaulieu, 35042 Rennes, France 
f rancois . coste@irisa.fr 
http : //www .irisa.fr/ symbiose/people/ coste 
3 Tilburg University, Postbus 90153, 5000LE, Tilburg, The Netherlands 

mvzaanen@uvt . nl 
http : / / ilk . uvt . nl/~mvzaanen 



Abstract. This paper describes the Omphalos Context-Free Grammar 
Learning Competition held as part of the International Colloquium on 
Grammatical Inference 2004. The competition was created in an effort to 
promote the development of new and better grammatical inference algo- 
rithms for context-free languages, to provide a forum for the comparison 
of different grammatical inference algorithms and to gain insight into the 
current state-of-the-art of context-free grammatical inference algorithms. 

This paper discusses design issues and decisions made when creating the 
competition. It also includes a new measure of the complexity of inferring 
context-free grammars, used to rank the competition problems. 

1 Introduction 

Omphalos is a context-free language learning competition held in conjunction 
with the International Colloquium on Grammatical Inference 2004 1 . The aims 
of the competition are: 

— to promote the development of new and better grammatical inference (GI) 
algorithms, 

— to provide a forum in which the performance of different grammatical infer- 
ence algorithms can be compared on a given task, and 

— to provide an indicative measure of the complexity of grammatical inference 
problems that can be solved with the current state-of-the-art techniques. 

2 The Competition Task 

The competition task was to infer a model of a context-free language from un- 
structured examples. During the development of the competition, two main de- 
sign issues needed to be resolved as follows: 

1 The competition data will continue to be available after the completion of the com- 
petition and can be accessed via http://www.irisa.fr/Qmphalos/. 

G. Paliouras and Y. Sakakibara (Eds.): ICGI 2004, LNAI 3264, pp. 16—27, 2004. 

(c) Springer- Verlag Berlin Heidelberg 2004 
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Method of evaluation. A method of determining the winner of the compe- 
tition needed to be decided upon. For instance entries could be judged by 
measuring the difference between the inferred language and the target lan- 
guage, or, alternatively, entries could be judged by measuring the difference 
between the derivation trees assigned to sentences by the inferred grammar 
and the derivation trees assigned to sentences by the target grammar. 

Complexity of tasks. As one of the goals of the competition was to determine 
the state-of-the-art in grammatical inference, the competition tasks should 
be selected so as to be neither too simple, nor too difficult to solve. That is 
the complexity of the learning task should be quantifiable. 

Both of these issues will be discussed in the following subsections below. 

2.1 Method of Evaluation 

The evaluation approach selected to be used by the competition should be auto- 
matic, objective and easy. Van Zaanen et al. [1] describe several approaches to 

the evaluation of grammatical inference systems. These include the following: 

Rebuilding known grammars. Using a pre-defined grammar, unstructured 
data is generated. Based on this data, the GI system tries to induce the 
original grammar. The problem here is that for most languages, there is 
more than one grammar that can be used to describe that language. Al- 
though all regular grammars can be transformed into a canonical form, no 
such canonical form exists for context-free grammars. Therefore there is no 
automated way to determine that the language described by a grammar 
submitted by a competitor and the target grammar are the same. 

Comparison of labelled test data with treebank. Plain data is extracted 
from a set of structured sentences. A GI system must now find the original 
structure using the unstructured data. A distance measure is then used to 
rank the similarity between the derivation trees assigned to test sentences by 
the inferred grammar and derivation trees assigned to test sentences by the 
target grammar. Once again because for most languages there is more than 
one grammar describing that language a grammar submitted by a competitor 
may describe the target language exactly, but would rank poorly according 
to the distance measure. In addition the approach is problematic when am- 
biguity occurs in either the target or inferred grammar as multiple solutions 
are then possible. 

Classification of unseen examples. The GI systems receive unstructured 
(positive or positive and negative) training data. This training data is gen- 
erated according to an underlying grammar. Next, test data is provided 
and the system designed by competitors should assign language member- 
ship information to this test data. In other words, the system must say for 
each sentence if it is contained in the language described by the underlying 
grammar. The main disadvantage of this evaluation approach is that since 
this task is a classification task, no real grammatical information has to be 
learned as such. 
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Precision and recall. Each GI system receives unstructured training data gen- 
erated according to a target grammar. From this training data a grammar is 
inferred. Next, previously unseen sentences that are in the target language 
are provided to the system. The percentage of these sentences that are in 
the inferred grammar is measured. This measure is known as the recall. Next 
sentences are generated according to the inferred grammar. The number of 
these sentences that exist in the target language is then measured. This 
measure is known as the precision. Recall and precision can then be merged 
into a single measure known as the F-score. The F-Score is a measure of 
the similarity between the target and inferred languages. A problem with 
this technique is that once a test sentence has been used to measure recall 
it cannot be used to measure recall a second time. This because that test 
sentence can also be used as an additional training example. In addition 
this technique requires that all grammatical inference systems be capable of 
generating example sentences. 

All these evaluation methods have problems when applied to a generic GI 
competition. It was decided however that for the Omphalos competition, the 
technique of “classification of unseen examples” would be used to identify the 
winner of the competition. This was the same evaluation method that was used 
for the Abbadingo [2] DFA (regular language) learning competition. For Ompha- 
los however competitors needed to tag the test sentences with 100% accuracy 
compared with 99% accuracy for the Abbadingo competition. This stricter re- 
quirement was used in an effort to encourage the development of new truly 
context-free learning algorithms. The main benefit of this technique is that it 
places few restrictions upon the techniques used to classify the data. In addi- 
tion if the inferred grammar describes the target language exactly it will classify 
the test examples exactly. The main disadvantage of this technique is that it is 
possible for a classifier to classify all of the test examples exactly without the 
classifier having an accurate model of the target language. One way to overcome 
this problem is to only consider the problem to be solved when the precision 
of the inferred grammar is greater than a threshold value. We have decided not 
to implement this additional constraint, since it is believed that if the test sets 
contain negative examples that are sufficiently close to positive examples, then 
classification accuracy is a suitable measure of how close the inferred grammar 
is to the target grammar. 

2.2 Complexity of the Competition Tasks 

The target grammars, training and testing sentences were created with the fol- 
lowing objectives in mind: 

Requirement 1. The learning task should be sufficiently difficult. Specifically 
the task should be just outside of the current state-of-the-art, but not so 
difficult that it is unlikely that a winner will be found. 

Requirement 2. It should be provable that the training sentences are sufficient 
to identify the target language. 
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From [3] it is known that it is impossible to learn a context-free grammar from 
positive examples only without reference to a statistical distribution, however: 

— It is possible to learn a context-free grammar if both positive and negative 
examples are available and, 

— If sufficient additional prior knowledge is known, such as a statistical dis- 
tribution, it is possible to learn a context-free grammar with positive data 
only. 

Therefore it was decided to include problems that included learning gram- 
mars from positive examples only as well as learning grammars from positive as 
well as negative examples. 

To determine whether or not Requirement 1 was met, a measure of the com- 
plexity of the learning task was derived. This measure was derived by creating 
a model of the learning task based upon a brute force search. To do this a 
hypothetical grammatical inference algorithm called the BruteForceLeamer al- 
gorithm was created. This model was also used to determine if the training data 
was sufficient to identify the target grammar. The details of the algorithm, the 
proof that it can identify any context-free grammar in the limit from positive 
and negative data, as well as the definition of the complexity measure itself can 
be found in [4]. 

The summary of these proofs is as follows: 

— For each context-free grammar G there exists a set of positive examples 
O(G) such that when O(G) is presented to the BruteForceLeamer algorithm, 
the BruteForceLeamer algorithm constructs a set of grammars X such that 
there exists a grammar G 2 in X with the property that L(G) = L(G 2 ). We 
call this set the characteristic set, and use the notation 0(G) to define the 
characteristic set of G. 

— Given G there exists a simple technique to construct O(G). This technique 
involves generating a sentence s for each rule P in G, such that all deriva- 
tions of s are derived using P. This technique was used in the Omphalos 
competition to construct some of the training examples. 

— When presented with the training examples 0{G ) the BruteForceLeamer 
need only construct a finite number of candidate grammars. Equation 1 
described below defines the number of candidate grammars that could be 
constructed that would be sufficient to include the target language. 

^candidate grammars = 2 ^^-^-' j C 2 !^ |— 2 ))+ 1 ) +((E j( 2 l°jl'*i 2 )+i)T( 0 ))) ^ 

Where G is any context-free grammar, O is a set of positive examples in a 
characteristic set of G,and T(O) is the number of terminals in O. 

— Given G and O(G), there exists an additional set of positive and negative ex- 
amples 02 (G) such that when 02 (G) is presented to the BruteForceLeamer 
algorithm after O(G), the BruteForceLeamer algorithm identifies the target 
language exactly. 
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The following technique can be used to construct 02 (G); 

— Given 0(G) construct the set of hypothesis grammars H that is sufficiently 
large to ensure that it contains the target grammar. 

— For each Hj eH such that L(0(Hj))cL(G) add one sentence, marked as a 
positive example, to 02 (G) that is an element of G but not an element of 
Hi. 

— For all other Hj GH where L(0(Hj))^L(G) add one sentence, marked as a 
negative example, to 02 (G) that is an element of Hj but not an element 
of G. 

Note that this is the only known technique for constructing the set 02 (G). The 
number of possible grammars given O(G) is described by Equation 1 which is not 
polynomial. Therefore the construction of the negative data using this technique 
is not computable in polynomial time using this technique. 

3 Creation of the Example Problems 

As a result of the proofs contained in [4] and summarized in the previous sec- 
tion, Equation 1 was considered to be a suitable measure of the complexity of 
the learning tasks of the Omphalos competition. This is because it defines a hy- 
pothesis space used by at least one algorithm that is guaranteed to identify any 
context-free language in the limit using positive and negative data. Equation 1 
was also used to benchmark the target problems against other grammatical infer- 
ence problems that were known to be solved using other algorithms. In addition 
the proofs contained in [4] showed that for all grammars, other than those that 
could generate all strings of a given alphabet, the BruteForceLearner algorithm 
required negative data to ensure that it uniquely identified any context-free lan- 
guage. As described in the previous section, the only known way that could be 
used to construct sufficient negative data to ensure that at least one known algo- 
rithm could identify the language exactly from positive data was not computable 
in polynomial time. Therefore if a target grammar was chosen that was small 
enough to enable sufficient negative training examples to be calculated, then the 
learning task would become too simple. Therefore no such set of negative data 
was calculated, and it is not known if for any of the competition problems the 
training examples are sufficient to uniquely identify the target language. 

The following technique was used to construct the training sets for the Om- 
phalos competition: 

— For each target grammar a set of positive sentences were constructed, such 
that for every rule in the target grammar, a positive sentence was added to 
the training set that is derived using that rule. 

— A set of positive examples were then randomly created from the target gram- 
mar of length up to five symbols longer than the longest sentence in the 
characteristic. 

— A set of negative sentences was then created for each target grammar. For 
problems 1 to 6 these were constructed by randomly creating strings up to 
the maximum length using the symbols of the grammar. For problems 6.1 
to 10 the negative examples were created from “surrogate” grammars such 
as regular approximations to the target languages. 
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The number of training examples was selected to be between 10 and 20 times 
as large as the characteristic set. 

3.1 Creation of the Target Grammars 

Firstly the literature was reviewed to identify pre-existing benchmark gram- 
matical inference problems. The work in [5] and [6] identified some benchmark 
problems, i.e. grammars that can be used as some sort of standard to test the 
effectiveness of a GI system by trying to learn these grammars. The grammars 
were taken from [7] and [8]. Using Equation 1 the complexities of these grammars 
were calculated. A description of these grammars and their complexity measures 
are listed in Table 1. 

Table 1 . Complexity of Benchmark Inference Problems from [5] and [6]. 



n 


Description 


Example phrase 


log 2 compl. 


Properties 


3 


(aa)* 


clcl ,, clclclcl, UUUelctct 


BBilfeiiBSailiia 


Regular 


m 


(ab)* 


ab, abab, ababab 




Regular 


El 


Operator precedence (small) 


(a+(b+a)) 




Not regular 


El 


Parentheses 


0, (0), 00, 0(0) 




Not regular 


5 


English verb agreement (small) 


that is a woman, i am 
there 




Finite 


6 


English lzero grammar 


a circle touches a square, 
a square is below a trian- 
gle 




Finite 


7 


English with clauses (small) 


a cat saw a mouse that 
saw a cat 




Not regular 


8 


English conjugations (small) 


the big old cat heard the 
mouse 




Regular 


El 


Regular expressions 


ab*(a)* 




Not regular 




{w = w K , c o € {a,b}{a,b}+} 


aaa, ba 


6.90 x 10 4 


Not regular 


mi 


Number of a’s=number of b’s 


aabbaa 


4.29 x 10 4 


Not regular 


mi 


Number of a’s=2x number of b’s 


aab, babaaa 


liiiifrana 


Not regular 


El 


{ ujuj\lo £ {a,b}+} 


aba, aa 


9.12 x 10 4 


Regular 


El 


Palindrome with end delimiter 


aabb$, ab$, baab$ 


Uj2g§H|5jg 


Not regular 


El 


Palindrome with center mark 


aca, abcba 




Not regular 


El 


Even length palindrome 


aa, abba 




Not regular 


mi 


Shape grammar 


da, bada 


o 

X 

iO 

c4 


Not regular 



Using the results of Table 1 the complexities of the target grammars of the 
competition problems were selected. The grammars were then created as follows: 

1. The number of non-terminals, terminals and rules were selected to be greater 
than in grammars shown in Table 1. 

2. A set of terminals and non-terminals were created. Rules were then created 
by randomly selecting terminals and non-terminals. A fixed number of rules 
were created to contain only terminal strings. 
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3. Useless rules were then identified. If a non-terminal could not generate a 
terminal string, a terminal rule was added to it. If a non-terminal was not 
reachable from the start symbol, rules were added to ensure the rule was 
reachable from the start symbol. For instance if the non-terminal N was 
unreachable from the start symbol, a rule was created with the start symbol 
on the left hand side of the rule, and N on the right hand side of the rule. 

4. Additional rules were added to ensure that the grammar did not represent a 
regular language. Specifically rules containing center recursion were added. 

5. A characteristic set of sentences was generated for the grammar. If the com- 
plexity of the grammar was not in the desired range, then the grammar was 
deleted. 

Using this technique six grammars were created as listed in Table 2. Tests 
were undertaken to ensure that grammars 1-4 represented deterministic lan- 
guages. Specifically LR{ 1) parse tables were constructed from the grammars 
using bison. To ensure that grammars 4 and 5 represented non-deterministic 
languages, rules were added to the target grammars. It should be noted that 
these grammars are complex enough that they cannot be learned using a brute 
force technique in time to be entered into the Omphalos competition. Having 
said that, even the smallest of grammars could not be inferred using the Brute- 
ForceLearner. After problem 6 was solved, problems 7 to 10 were added to the 
competition. Problems 6.1 to 6.6 were added some time later. 

The grammars listed in Table 2 represent three axes of difficulty in gram- 
matical inference. Specifically: 

1. The complexity of the underlying grammar, 

2. whether or not negative data is available and, 



Table 2. Complexities of Benchmark Inference Problems in Omphalos Competition. 





Training data 


Properties 


log 2 compl. 


1 


Positive and negative 


Not regular, deterministic 


1.10 x 10 y 


2 


Positive only 


Not regular, deterministic 


7.12 x 10 s 


3 


Positive and negative 


Not regular, deterministic 


1.65 x 10 1U 


4 


Positive only 


Not regular, deterministic 


1.13 x 10 1U 


5 


Positive and negative 


Not regular, non-deterministic 


5.46 x 10 1U 


6 


Positive only 


Not regular, non-deterministic 


6.55 x 10 1U 


6.1 


Positive and negative 


Not regular, deterministic 


1.10 x 10 a 


6.2 


Positive only 


Not regular, deterministic 


7.12 x 10 s 


6.3 


Positive and negative 


Not regular, deterministic 


1.65 x 10 1U 


6.4 


Positive only 


Not regular, deterministic 


1.13 x 10 1U 


6.5 


Positive and negative 


Not regular, non-deterministic 


5.46 x 10 1U 


6.6 


Positive only 


Not regular, non-deterministic 


6.55 x 10 1U 


7 


Positive and negative 


Not regular, deterministic 


5.88 x 10 11 


8 


Positive only 


Not regular, deterministic 


1.63 x 10 11 


9 


Positive and negative 


Not regular, non-deterministic 


1.08 x 10 1 " 


10 


Positive only 


Not regular, non-deterministic 


9.92 x 10 11 
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3. how similar the negative examples in the test set are to positive examples 
of the language. For instance whether or not the test set includes sentences 
that can be generated by regular approximations to the target language but 
not the target language itself. 

The competition adopted a linear ordering for the benchmark problems based 
upon these axes. Correctly labelling a test set in which the negative sentences 
closely resembled the positive sentences was ranked higher than correctly la- 
belling a test set where the negative examples differed greatly from the positive 
examples. For instance, problem 6.1 is ranked higher than problem 1 and even 
problem 6. Similarly, solving a problem with a higher complexity measure was 
ranked higher than solving one with a lower complexity measure. For instance, 
problem 3 is ranked higher than problem 1. Solving a problem without using 
negative data was considered to be a more difficult problem than when negative 
data was used. For instance, problem 2 is ranked higher than problem 1. In ad- 
dition it has been noted by [6] that the inference of non-deterministic languages 
is a more difficult task than the inference of deterministic languages. Therefore 
solving those problems that involved non-deterministic languages was ranked 
higher than solving those problems that involved deterministic languages. 



3.2 Construction of Training and Testing Sets 

Once the target grammars were constructed, characteristic sets were constructed 
for each grammar. Sets of positive examples were then created using the GenR- 
GenS software [9]. 

For the first six problems additional examples were then created by randomly 
generating sentences of length up to five symbols more than the length of the 
longest sentence in the characteristic set. These sentences were then parsed using 
the target grammar and were labeled as being either in or out of the target 
language. This set of sentences was then randomized and split into testing and 
training sets, but in such a way as to ensure that the training set contained a 
characteristic set. For those problems that were to be learned from positive data 
only the training sets had all negative examples removed. 

For problems 6.1 to 10 a more rigorous method of constructing negative data 
was used as follows: 

— For each context-free grammar an equivalent regular grammar was con- 
structed using the superset approximation method based on Recursive Tree 
Network (RTN) described in [10]. Sentences that could be generated from 
this regular approximation to the target language were included as negative 
data. These sentences were included to distinguish between competitors who 
had created regular approximations to the underlying context-free languages, 
and competitors who had identified a non-regular language. 

— A context-free grammar that was larger than the target language was con- 
structed by treating the target grammar as a string rewriting system, and 
normalizing the right hand sides of rules using the normalization algorithm 
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described in [11]. That is, a context-free grammar was constructed that de- 
scribed a language that was a superset of the target language, but in which 
the right hand side of each rule could not be parsed by the right hand side of 
any other rule. Negative examples were then created from this approximation 
to the target language. 

— Each target grammar in the competition included some deliberate constructs 
designed to trick grammatical inference algorithms. For instance most in- 
cluded sequences that where identical to center recursion expanded to a 
finite depth. An example is A n B n where n < in, m is an integer > 1. To en- 
sure that the training and testing examples tested the ability of the inferred 
grammars to capture these nuances, the target grammars were copied and 
hand modified changing the A n B n where n < m to become A n B n where 
n > 1. In addition, where center recursion existed of the form A n B n in 
the target grammar the regular approximations A*B* were included in the 
“tricky” approximation to the target grammar. Negative examples where 
then created from these approximations to the target grammar and added 
to the test set. 

— There were an equal number of positive and negative examples in the test 
sets. 

In addition for problems 6.1 to 10: 

— The longest training example was shorter than the longest test example. 

— The grammar rules for problems 7 to 10 were shorter than for problems 1 to 
6.6 and had more recursion. Some non-LL(l) constructs were also added to 
the target grammars for problems 7 to 10. 

4 Preliminary Results 

The timetable of the competition was constructed such that the competition 
ended two weeks prior to the ICGI 2004 conference in which this paper appears. 
Due to the deadlines involved in publishing the proceedings the results of the 
competition cannot be contained within this paper. The following table includes 
some important dates on the time line of the competition. 

4.1 Problem 1 

Problem 1 was solved by Joan Andreu Sanchez from the Departament de Sis- 
temes Informatics i Computacio, Universitat Politecnica de Valencia. Although 
Sanchez originally tried to solve the problem using the Inside-Outside algorithm, 
he actually solved it manually. After discovering the regularities in the positive 
examples he used a regular expression constructed by hand to classify the test 
examples as being either positive or negative. Although the target grammar was 
not a regular language the test sets did not include a single non-regular example. 
This in addition to the speed in which the problem was solved suggests that the 
first task was overly simple, and the negative examples were too different from 
the positive examples to be an accurate test of whether or not the language had 
been successfully learned. 
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Table 3. Competition time line. 



Date 


Event 


February 15 th 2004 


Competition begins 




Problem 1 solved by Joan Andreu Sanchez 


March 22 nd 2004 


Problems 3, 4, 5 and 6 were solved by Erik Tjong Kim Sang 


Aprini^^OOi 


Problems 7, 8, 9 and 10 were added 


June 7 th 2004 


New larger testing sets were added for problems 1 to 6 


October l 3t 2004 


Competition closed 


October 11 th 2004 


Competition winner announced 


BBiiH 


Omphalos session at ICGI-2004 



4.2 Problems 3, 4, 5, and 6 

Problems 3, 4, 5, and 6 were solved by Erik Tjong Kim Sang from the CNTS 
- Language Technology Group at the University of Antwerp in Belgium. Tjong 
Kim Sang used a pattern matching system that classifies strings based on n- 
grams of characters that appear either only in the positive examples or only in 
the negative examples. With the exception of problem 1 this technique was not 
sufficient to solve the problem, so Tjong Kim Sang generated his own negative 
data, using the principle that the majority of randomly generated strings would 
not be contained within the language. His software behaved as follows; 

1. Firstly it loaded in positive examples from the training file. 

2. It then generated an equal number of unseen random strings, and added 
these to the training data as negative examples. 

3. A n-gram classifier was then created as follows: A count was made of n- 
grams of length 2 to 10 that appeared uniquely in the positive examples 
or uniquely in the negative examples. A weighted (frequency) count of such 
n-grams in the test strings was then made. For each sentence in the test 
set. If the positive count was larger than the negative count then the string 
was classified as positive, otherwise it was classified as negative. If a string 
contained two zero counts then that sentence was classified as unknown. 

4. Steps 2 and 3 were repeated thirty times. 

5. Strings that were always classified as positive in all thirty tests were then 
assumed to be positive examples. Strings that were classified as negative 
one or more time were classified as negative. Other strings were classified as 
unknown. 

The techniques used by Sanchez and Tjong Kim Sang suggested that more 
effort was required to generate negative data, to ensure that the testing sets 
were accurate indications of whether or not the competitor had successfully 
constructed a context-free grammar that was close to the exact solution. In 
particular the testing sets needed to include negative sentences that were in 
regular approximations of the target language, but not in the target language 
itself. As result, additional problems were added to the competition on April 
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14 th . In addition, on June 7 th additional test sets for problems 1 to 6 were 
added to the competition. Because the correct classification of these test sets 
was a more difficult task than the correct classification of the earlier test sets, 
these test sets became problems 6.1 to 6.6. 

5 Conclusions 

In conclusion, at the time of the writing of this paper the competition is yet to 
achieve the goal of encouraging the development of new grammatical inference 
algorithms that can infer truly context-free languages. We believe there are two 
reasons for this; Firstly, generic machine learning classifiers have been used to 
solve the “easy” problems, so GI researchers do not attempt to re-solve these. 
Secondly, the Omphalos problems were designed to be just out of reach of the 
current state-of-the-art. Since the data-sets will stay available for some time, we 
expect these problems to be solved in the near future. The goal of providing an 
indicative measure of the complexity of grammatical inference problems that can 
be solved using current state of the art techniques has however been partially 
achieved. An equation (Equation 1) has been developed that defines the size of 
a set of context-free grammars that can be constructed from a set of training 
sentences, such that the target language is guaranteed to be contained in this 
set of context-free grammars. 
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Abstract. State Merging algorithms, such as Rodney Price’s EDSM 
(Evidence-Driven State Merging) algorithm, have been reasonably suc- 
cessful at solving DFA-learning problems. EDSM, however, often does 
not converge to the target DFA and, in the case of sparse training data, 
does not converge at all. In this paper we argue that is partially due 
to the particular heuristic used in EDSM and also to the greedy search 
strategy employed in EDSM. We then propose a new heuristic that is 
based on minimising the risk involved in making merges. In other words, 
the heuristic gives preference to merges, whose evidence is supported by 
high compatibility with other merges. Incompatible merges can be triv- 
ially detected during the computation of the heuristic. We also propose 
a new heuristic limitation of the set of candidates after a backtrack to 
these incompatible merges, allowing to introduce diversity in the search. 



1 Introduction 

Most real world phenomena can be represented as syntactically structured se- 
quences. Examples of such sequences include DNA, natural language sentences, 
electrocardiograms, speech signals, chain codes, etc. Grammatical inference ad- 
dresses the problem of extracting/learning finite descriptions/representations, 
from examples of these syntactically structured sequences. Deterministic finite 
state automata (DFA) are an example of a finite representation, used to learn 
these sequences. Section 2 presents to the reader some preliminary definitions. 
Section 3 describes the current leading DFA-learning algorithm EDSM, whereas 
sections 4 and 5 introduce a novel heuristic for EDSM, namely S-EDSM, and 
the backtracking heuristic. Finally, sections 6 and 7 document the initial results 
of this new heuristic. 

2 Preliminary Definitions 

This section introduces the reader with the terms and definitions used through- 
out this paper. It is being assumed that the reader is already familiar with the 
definitions and results in set theory and formal languages, as well as the area of 
DFA learning in particular state merging DFA learning algorithms. 
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2.1 Transition Trees 

Transition trees represent the set of string suffixes of a language L from a par- 
ticular state. Transition trees are mapped onto each other by taking the state 
partitions of the two transition trees and joining them into new blocks to form a 
new state set partition. The mapping operation is recursively defined as follows: 

Definition 1 (Map) A transition tree t\ is mapped onto transition tree t 2 by 
recursively joining together blocks of the set partitions of ti and t 2 , for each 
common string prefix present in t\ and f 2 ■ The result of this mapping operation 
is a set partition 7 r, consisting of a number of blocks b. Each block in it, is a set 
of states which have been merged together. 

2.2 State Compatibility and Merges 

States in a hypothesis DFA are either unlabeled or labeled as accepting or re- 
jecting. Two state labels A, B are compatible in all cases except when, A is 
accepting and B is rejecting, or, A is rejecting and B is accepting. Two states 
are state compatible if they have compatible labels. The set of all possible 
merges is divided between the set of valid merges, Ai\>, and that of invalid 
merges, A4x- A valid merge is defined in terms of the transition trees mapping 
operation as follows: 

Definition 2 (Valid Merge) A valid merge My in a hypothesis DFA H is 
defined as (q,q'), where q and q ' are the states being merged, such that, the 
mapping of q' onto q results in a state partition 7 r of H , with a number of blocks 
b, such that for each block b £ n, all states in b are state compatible with each 
other. 



3 State Merging Algorithms 

The first state merging algorithm is due to Trakhtenbrot and Barzdin [1]. In 
their algorithm all the states of the APTA are labeled, hence the algorithm 
does not make any labeling decisions. We then see Gold’s algorithm, in which 
the algorithm determines the label of unlabeled states in the APTA. Clearly, 
the order in which merges occur, determines the effectiveness of the learning 
process. In this algorithm only compatibility is considered, and evidence is not 
taken into account. EDSM improves on this algorithm by ordering merges on 
the amount of evidence of each individual merge. [2] describes the search space 
of the regular inference. 

3.1 EDSM 

The Evidence Driven State Merging (EDSM) algorithm developed by Price [3] 
emerged from the Abbadingo One DFA learning competition organised by Lang 
and Pearlmutter [3] in 1998. EDSM searches for a target DFA within a lattice 
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of hypotheses (automata) enclosed between the augmented prefix tree acceptor 
(APTA) and the Universal Acceptor Automaton (UA) [2]. EDSM only considers 
DFAs that are consistent with the training examples. It is assumed that the tar- 
get DFA lies in the search space of EDSM. It therefore follows, that at least one 
sequence of merges exists that will lead to the target DFA. The algorithm starts 
by constructing an augmented prefix tree acceptor (APTA) and then progres- 
sively performing merges. The search is guided by an evidence heuristic which 
determines which pair of states are to be merged [3] . The heuristic is based upon 
a score, that computes the number of compatible states found in the transition 
trees of the two states under consideration. At each iteration, all possible pairs 
of states in the current hypothesis are evaluated 1 . The pair with the highest 
evidence score is chosen for the next merge. This procedure is repeated until no 
more states can be merged. 

3.2 Problems with EDSM 

Although EDSM was one of the winners of Abbadingo One, it still could not 
solve the hardest four problems. These problems were characterised by very 
sparse training sets. If EDSM is to find the target DFA (or some other DFA 
that is close to it) it must, at each iteration of the algorithm, make a ‘correct’ 
merge. The scoring function is therefore critical in determining the direction to 
be taken within the set of possible merges. Since EDSM is a greedy depth-first 
search it is very sensitive to mistakes made in early merges. The algorithm does 
not backtrack to undo a ‘bad’ merge. In general, it is not possible to determine 
when a ‘bad’ merge had been made. Very often, EDSM converges to a DFA that 
is of much larger size than the target DFA. This is evidently because, EDSM 
makes some ‘bad’ merges in the beginning. 

4 Shared Evidence 

In order to improve on what EDSM does, we need to somehow gather more 
evidence from what is available. We propose that evidence can be augmented by 
gathering and combining together, the information derived from the interactions 
between all these valid merges. This combination of individual merge evidence, 
referred to as shared evidence, results in an improvement on the heuristic score. 
These sets of compatible merges, empirically prove to be valuable in two aspects. 
Initially they minimise the risk of making mistakes with the initial merges, thus 
decreasing the size of the hypothesis DFA from that generated by EDSM. Sec- 
ondly, they prune the search tree (when a search strategy is applied) in such 
a way that equivalent merges are grouped together and need not be traversed 
individually. The strategy can also be seen as a kind of lookahead, computing 
an expectation of the score that you can expect in the next choices. 

Shared evidence driven state merging (S-EDSM) , is an attempt at a heuristic 
that tries to minimise the risk of a merge. Undoubtedly there exist no risk 

W-EDSM, also presented in [3], takes only a subset of all the possible merges. 
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free merges, however as opposed to EDSM, S-EDSM tries to share this risk 
across multiple merges. In this paper we are proposing a subset of a preliminary 
calculus, which specifically deals with how merges interact. A heuristic is then 
developed based on some of the properties derived from this analysis. 



4.1 Pairwise and Mutual Compatible Merges 

The most basic interaction between two merges is compatibility. Merge M is 
said to be pairwise compatible to merge M' if after performing merge M, M' 
remains a valid merge in the hypothesis automaton as changed by M. From this 
point onwards we will refer to merges which are elements of .My - More formally 
two merges are pairwise compatible, if the following property holds: 

Definition 3 (Pairwise Compatible) Let 7Ti and 7r 2 be the state partitions 
resulting from the application of the map operator to the two merges M\ and 
M 2 on hypothesis H. Let H i and f/ 2 be the hypotheses resulting from ni, 7 r 2 
respectively. M\ and M 2 are pairwise compatible if for each state s £ H , s £ 
H\ is state compatible with s £ i/ 2 . 

Table 1 . Score Calculation for Pairwise Compatible States 



Current Merge 


EDSM Score 


S-EDSM Score 


Pairwise Compatible Merges 


Ml 


7 


7+2 


{M8} 


M2 


6 


6+5+2+1 


{M3,M8,M10} 


M3 


5 


5+6+5+4+2 


{M2, A/4, M6, A/8} 


M4 


5 


5+5+4+2 


{A/3, M5, A/8} 


M5 


4 


4+5+3+2+1 


{A/4, A/7, A/8, A/10} 


M6 


4 


4+5+2 


{A/3, A/8} 


M 7 


3 


3+4+2 


{M5, M8} 


M8 


2 


2+7+6+5+5 


{Ml, M2, A/3, M4, A/5, 






4+4+3+1+1 


A/6, A/7, A/9, M10} 


M9 


1 


1+2+1 


{A/8, A/10} 


M10 


1 


1+6+4+2+1 


{A/2, M 5, A/8, M9} 



Consider the set V C .My, consisting of the 10 merges {Ml .. M 10}. Table 
1 lists merges M 1 to M10, together with the set of merges which are pairwise 
compatible to each merge. For instance, merges M 3 and M 8 are both pairwise 
compatible with M2. However, note that this does not necessarily imply that 
M3 is pairwise compatible with M8. Moreover from the definition of pairwise 
compatibility it follows that, if M3 is pairwise compatible with M4 then M4 
is pairwise compatible with M3. This means that the order in which the two 
merges are executed does not change the resulting state partition. 

Let pairwise compatibility between two states be denoted by the symbol 
{. Pairwise compatibility induces the binary relation |C A fy x A4y. Thus, 
Ml | M2 denotes that Ml is pairwise compatible with M2. Moreover, Ml 
| {M2,M3,M4} denotes that, Ml is pairwise compatible with M2, M3, and 
M4. Note that { is a symmetric relation. 
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Fig. 1. Simple example of two pairwise compatible merges 



Ml | M2 — *■ M2 T Ml 

Let us now consider the possibility that more that just two merges are com- 
patible with each other. This means that for a particular hypothesis H, there 
might exist n valid merges with n > two such that the execution of merges M i 
to M n _ i does not effect the validity of merge M n . In this case we need to extend 
the definition of pairwise compatibility to include more than two merges. Mutual 
compatibility is defined as follows: 

Definition 4 (Mutual Compatibility) Let n be equal to an arbitrary number 
of valid merges. Let 7Ti, 7T2, ..., 7r n be the state partitions created when applying 
the map operator to the n merges M\, M 2 , ■ ■■, M n on hypothesis H. Let Hi, 
H 2 , .... H n be the hypotheses resulting from ni, n 2 , ■ ■■, 7r n respectively. Mi, ..., 
M n are mutually compatible if, for each state s £ H , s is state compatible 
with s £ Hi A ... A s £ H n _ 1 A s £ H n . 

5 Merge Heuristic 

Shared Evidence Driven State Merging (S-EDSM) is the algorithm based on 
the definitions presented in the previous section. Only pairwise compatibility is 
used in the heuristic. There are plans, however, to extend the heuristic to also 
incorporate other ideas such as merge mutual compatibility, merge dominance, 
merge coverage and merge intersection. The term shared was used to underline 
the basic notion that, the heuristic works by gathering and combining, thus 
sharing, the information of individual merges. 

5.1 Increasing Evidence 

Consider a scenario, where a number of valid merges can be performed. Table 
1 shows ten potential merge candidates, with Ml’s heuristic score being the 
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highest and M 10’s heuristic score the lowest. While EDSM just executes the 
merge with the highest heuristic score (in this case Ml), S-EDSM first checks 
for pairwise compatibility between all the merges and creates a list of pairwise 
compatible merges for each merge. Table 1 also shows an example of how these 
merges can be grouped through pairwise compatibility. The second column in 
this table indicates the EDSM score for the merge listed in the first column. 
Hence, the first row in the column is read as Ml j {M 8} and the second row as 
M2 | {M3, M8, M10}. Note that the sets are not checked for mutual compatibil- 
ity. When EDSM merges M 1 (the merge with the highest evidence score) , merges 
M2, M3, M4, M5, M6, M7, M 9 and M10 become invalid. This means that for 
the next merge selection procedure only merge M 8 is possible, since this is the 
only merge which is pairwise compatible with Ml. On the other hand, S-EDSM 
proceeds by first re-ordering the merges according to the pairwise compatibility 
of the merges. Column 3 of table 1 shows how the scores are re-calculated for all 
the merges. For the time being scores are simply added together. For instance, 
in the case of M2, the new score is added to M3’s score (5), M8’s score (2), and 
MIO’s score (1). Thus, the score of the pairwise compatible set of M2 is set to 
13. Simply adding the scores might however not be the best way of re-calculating 
evidence. 

M 8 is executed first by S-EDSM, since M 8 is supported by evidence from 
all the other valid merges. M4 is the second merge to be executed, followed by 
M3. Once M 3 is done, no more merges are possible. The merge sequence cre- 
ated by EDSM consists of two merges, Ml and M 8. S-EDSM creates a merge 
sequence of three merges, M8, M4, then M3. The overall EDSM heuristic score 
for this sequence is 9, whereas for S-EDSM it is 12. Thus, it seems that overall 
S-EDSM has performed a sequence of merges which is better than the one cho- 
sen by EDSM. However, one should note that for the time being only pairwise 
compatibility is being considered. Pairwise compatibility alone does not give 
sufficient knowledge of how many states are actually giving distinct evidence. 
By distinct evidence we mean, the evidence that is given uniquely by a state 
through a state compatibility check. With pairwise compatibility sets, the same 
state compatibility check may account (and in practice it usually does) for in- 
creasing the evidence score of the pairwise compatible set when summing the 
individual EDSM scores. Consider for example figure 2. In this simple example, 
we have two possible merges (q3,q2,{q2}) and (q2,ql,{ql}) with EDSM evi- 
dence scores of one and two respectively. Clearly these two merges are pairwise 
compatible. Note however that when re-calculating the scores for S-EDSM, the 
evidence given by q2 is counted twice when calculating the S-EDSM score for 
the pairwise compatible set of ql. 



5.2 Pairwise Compatibility — A Lookahead Strategy 

Recall that EDSM’s main problem is that when only a few states are labeled 
in the APTA, it is very difficult for the heuristic to determine which merge to 
perform. S-EDSM, by using pairwise compatibility for single merges, tries to 
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(q3 ,q2,{q2}) 



, (q2 ,ql,{ql}) 



•**-*.' 



Fig. 2. Two simple merges 



alleviate this problem by making use of a lookahead strategy. Lookahead occurs 
because of the following two factors: 

— By gathering evidence for a single merge M from multiple merges pairwise 
compatible with M, and 

— By increasing the number of labeled states (labeled by M) before recalcu- 
lating the EDSM scores of the pairwise compatible states of M. 

The execution of each valid merge M, before calculating the pairwise com- 
patibility set of M , actually accounts for more evidence in terms of state labels 
which can now be used. S-EDSM works by always calculating these sets for 
each step in the search. Consider the DFA illustrated in figure 1. We know that 
for merge Ml = (g0,g2,{g0}) and merge M2 = (gl, g4, {gl}), Ml } M2. Ml’s 
EDSM score is equal to zero simply because there are no states which are state 
compatible, which are both accepting or rejecting in the respective transition 
trees of gO and g2. However, when calculating the pairwise compatible set for 
merge M2, EDSM’s score for merge Ml becomes one, since now gl has been 
labeled as rejecting by M2. This new evidence supports the execution of M2. 

5.3 Calculating the Evidence Score 

Computing the set of pairwise compatible merges for such a large number of 
merges is not feasible. The strategy adopted by S-EDSM is to include a param- 
eter which determines the set Mis, of valid merges which are taken in consid- 
eration. Two possibilities have been implemented, with the first option used for 
the experiments. 

— Identify the merge M# with the highest evidence score. Include in Ms, 
all those merges whose EDSM score falls into a percentage from this score 
(typically 70% of the best score in the experiments). 

— Include in Ms, a percentage of all the valid merges ordered by EDSM score 
in descending order. 



5.4 Using Incompatible Merges in Backtrack 

How to backtrack efficiently in the state merging framework is still an open 
question when the search space is too big for complete algorithms. Since the 



Mutually Compatible and Incompatible Merges 



35 



Algorithm 1 Merge Score Calculation for S-EDSM 
Require: A hypothesis H 
Require: A set V C Mv on H 
for j — 1 to size(V) do 

enableTrackingOfMergeChangesOnStack 
H <— executeMerge(Vj) 
for k = 1 to size(V) do 

if Vk is still a valid merge then 
Vk is pairwise compatible to Vj 

include Vk in set of pairwise compatible merges of Vj 
score Vj = score Vj + score Vk 

else 

Vk is not pairwise compatible to Vj 
score Vj = score Vj - score Vk 

end if 
end for 

restoreMergeChangesFromStack 

end for 

H <— executeMerge{highestScore(V )) 



Abbadingo competition, it is admitted that the first choices of EDSM are less 
informed when learning data become sparse and thus that the earlier merges are 
the most critical ones. After the winning but cpu demanding approach of Juille 
[4], only few proposals have been made, focusing essentially on this first choices 
by a wrapper technique [5] or even by choosing randomly the first merge [6]. 
We believe that these methods have been moderately fruitful because they fail 
escaping from the neighbourhood of the first solution, visiting always the same 
area of the search space. 

We propose here to introduce diversity in the exploration of the search space 
by limiting the choice of candidate merges after a backtrack to the set of incom- 
patible merges with the undone merge. The set of incompatible merges with the 
merge with the highest score can be easily memorised with a small modification 
of algorithm 1: when Vk is not pairwise compatible to Vj, Vk can be added to 
a set of incompatible merges of Vj (Let us remark that only two sets of incom- 
patible merges are needed: the current one and the one of the best candidate, 
denoted hereafter I) . 

A simple implementation of the backtrack scheme may then consist in re- 
placing the last line of algorithm 1 by: 
enableTrackingOfMergeChangesOnStack 
H <— executeMerge{highestScore{y )) 
restoreMergeChangesFromStack 
H <— executeMerge{highestScore(V fll)) 

More subtle implementations around this scheme can be developed by choos- 
ing to propagate the limitation of the candidates to the next choices, but these 
have to be chosen carefully according to the search strategy and the heuristic. 
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32 States Classification 




Fig. 3. Ten 32 state Gowachin Problems - Average Classification Rate 



We show only here some preliminary results without propagation on small 
target automata. In these experiments, the backtrack has been limited to the 
three first merges. Figure 3 shows that using the incompatible merges backtrack 
scheme (I-BT) allows to outperform the reference algorithm with the same S- 
EDSM score calculation (BT) on sparser training samples. 

6 S-EDSM Classification Rate 

Classification of testing data is the best indicator of how well a learning algorithm 
performs. A comparison on classification rate between EDSM and S-EDSM is 
not a straightforward task. This is because there are numerous examples, when 
sparse training sets are used, were both algorithms perform poorly. Suppose 
EDSM achieves a classification rate of 0.51 and S-EDSM achieves a classification 
rate of 0.55. In these situations, there is really no difference between the two 
hypotheses. For this reason, classification rate experiments were carried out as 
follows. For a given problem, the target DFA was inferred by using a large 
training set. Portions of this training set are then systematically removed and the 
two learning algorithms are then applied on the new test sets. With this method 
we can check which algorithm requires the least amount of training strings in 
order to infer a good DFA. Since the target DFA is not known, training sets with 
30,000 strings are initially used to exactly infer the target DFAs. The graphs of 
figure 4 show the average classification rates for ten DFAs with target sizes of 
256, 192 and 128 states respectively. The three sets of ten problems each, have 
been sequentially downloaded from Gowachin. 

With 128 states, the main difference between S-EDSM and EDSM occurs 
when the training set is reduced below 6,000 strings. The average classification 
rate for EDSM degenerates to 1,581 correctly classified strings (i.e. 87%). S- 
EDSM maintains (although still decreasing) an average classification rate of 
1,737 (i.e. 96%). This indicates that, S-EDSM is somewhat less sensitive to 
sparse training sets. For the 256 states problems, S-EDSM starts to better classify 
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128 States Classification 
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192 States Classification 
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Fig. 4. Ten 128, 192 and 256 state Gowachin Problems - Average Classification Rate 



the testing set with 10,000 strings. With 9,000 strings in the training set, the 
distance in classification rate between EDSM and S-EDSM increases. S-EDSM 
on average classifies correctly 1,736 strings (96%), while EDSM classifies 1,671 
(92%). The biggest difference in classification rate occurs when the training set 
contains 7,000 strings. The average classification rate for S-EDSM remains high, 
while EDSM’s classification rate drops considerably. Finally, for the set of ten 
192 state problems, a similar behaviour is observed. 
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Fig. 5. Ten 300, 400 and 500 state Gowachin Problems - Final Hypothesis Size Com- 
parison 



7 Final Hypothesis Size 

The final hypothesis size, gives an indication of how well a learning algorithm is 
searching the lattice of automata compatible with the training set. However, if 
using Occam’s razor, when both algorithms give a classification rate of 0.5, we 
can argue that the size of the final hypothesis constitutes a measure which can 
be used to discriminate between the two algorithms. Essentially, the smaller the 
number of states used, to generalise a finite set of examples, the better. Figure 5 
shows thirty consecutively downloaded Gowachin problems with a target size of 
500 (24000 strings), 400 (20000) and 300 (12000) states. Although thirty prob- 
lems certainly do not constitute an exhaustive sample of problems, the sample 
is enough to demonstrate a trend in the target sizes of the final hypothesis. 

It is clear from these thirty problems that, when the percentage of the la- 
beled nodes goes under 20%, S-EDSM outperforms EDSM heavily in target size 
convergence. S-EDSM’s heuristic seems to be paying back in terms of target size 
convergence. 
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8 Conclusion and Perspectives 

Initial results on S-EDSM’s heuristic show that there is an improvement in both 
classification rate and final hypothesis size. The reason for this can be attributed 
to the fact that, S-EDSM augments its evidence score by combining the infor- 
mation of multiple valid merges. By doing so, S-EDSM avoids very ‘bad’ merges 
in the beginning, when the information is sparse. Considering both compati- 
ble merges and incompatible merges seems also promising and we think that 
S-EDSM coupled with a backtrack heuristic based on incompatible merges is 
worth being studied more systematically on artificial and also real data. 
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Abstract. Hidden Markov Models (HMMs) are probabilistic models, suitable 
for a wide range of pattern recognition tasks. In this work, we propose a new 
gradient descent method for Conditional Maximum Likelihood (CML) training 
of HMMs, which significantly outperforms traditional gradient descent. Instead 
of using fixed learning rate for every adjustable parameter of the HMM, we 
propose the use of independent learning rate/step-size adaptation, which has 
been proved valuable as a strategy in Artificial Neural Networks training. We 
show here that our approach compared to standard gradient descent performs 
significantly better. The convergence speed is increased up to five times, while 
at the same time the training procedure becomes more robust, as tested on ap- 
plications from molecular biology. This is accomplished without additional 
computational complexity or the need for parameter tuning. 



1 Introduction 

Hidden Markov Models (HMMs) are probabilistic models suitable for a wide range 
of pattern recognition applications. Initially developed for speech recognition [1], 
during the last few years they became very popular in molecular biology for protein 
modeling [2,3] and gene finding [4,5]. 

Traditionally, the parameters of an HMM (emission and transition probabilities) 
are optimized according to the Maximum Likelihood (ML) criterion. A widely used 
algorithm for this task is the efficient Baum- Welch algorithm [6], which is in fact an 
Expectation-Maximization (EM) algorithm [7], guaranteed to converge to at least a 
local maximum of the likelihood. Baldi and Chauvin later proposed a gradient de- 
scent method capable of the same task, which offers a number of advantages over the 
Baum-Welch algorithm, including smoothness and on-line training abilities [8]. 

When training an HMM using labeled sequences [9], we can either choose to train 
the model according to the ML criterion, or to perform Conditional Maximum Likeli- 
hood (CML) training which is shown to perform better in several applications [10]. 
ML training could be performed (after some trivial modifications) with the use of 
standard techniques such as the Baum-Welch algorithm or gradient descent, whereas 
for CML training one should rely solely on gradient descent methods. 

The main advantage of the Baum-Welch algorithm (and hence the ML training) is 
due to its simplicity and the fact that requires no parameter tuning. Lurthermore com- 
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pared to standard gradient descent, even for ML training, the Baum-Welch algorithm 
achieves significantly faster convergence rates [11]. On the other hand gradient de- 
scent (especially in the case of large models) requires careful search in the parameter 
space for an appropriate learning rate in order to achieve the best possible perform- 
ance. 

In the present work, we extend the gradient descent approach for CML training of 
HMMs. We then, adopting ideas from the literature regarding training techniques 
applied on feed-forward back-propagated multilayer perceptrons, introduce a new 
scheme for gradient descent optimization for HMMs. We propose the use of inde- 
pendent learning rate/step-size adaptation for every trainable parameter of the HMM 
(emission and transition probabilities), and we show that not only outperforms sig- 
nificantly the convergence rate of the standard gradient descent, but also leads to a 
much more robust training procedure whereas at the same time it is equally simple 
enough, since it requires almost no parameter tuning. 

In the following sections we will first establish the appropriate notation for de- 
scribing a Hidden Markov Model with labeled sequences, following mainly the nota- 
tion used in [2] and [12]. We will then briefly describe the algorithms for parameter 
estimation with ML and CML training and afterwards we will introduce our proposal, 
of individual learning rate adaptation as a faster and simpler alternative to standard 
gradient descent. Eventually, we will show the superiority of our approach on a real 
life application from computational molecular biology, training a model for the pre- 
diction of the transmembrane segments of [3-barrel outer membrane proteins. 



2 Hidden Markov Models 

A Hidden Markov Model is composed of a set of (hidden) states, a set of observable 
symbols and a set of transition and emission probabilities. Two states k, l are con- 
nected by means of the transition probabilities a kl , forming a 1 st order Markovian 
process. Assuming a protein sequence x of length L denoted as: 

X = x l ,x 2 ,...,x L (1) 

where the x/s are the 20 amino acids, we usually denote the “path” (i.e. the sequence 
of states) ending up to a particular position of the amino acid sequence (the sequence 
of symbols), by n. Each state k is associated with an emission probability e k (x t ), 
which is the probability of a particular symbol jq to be emitted by that state. When 
using labeled sequences, each amino acid sequence x is accompanied by a sequence 
of labels y for each position i in the sequence: 

y = y t ,y 2 ,-,y L (2) 

Consequently, one has to declare a new probability distribution, in addition to the 
transition and emission probabilities, the probability A k (c) of a state k having a label 
c. In almost all biological applications this probability is just a delta-function, since a 
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particular state is not allowed to match more than one label. The total probability of a 
sequence x given a model is calculated by summing over all possible paths: 

I 0) = I °) = Yi a o*Tl e m( X i K,*,., (3) 

n n i=l 

This quantity is calculated using a dynamic programming algorithm known as the 
forward algorithm, or alternatively by the similar backward algorithm [1], In [9] 
Krogh proposed a simple modified version of the forward and backward algorithms, 
incorporating the concept of labeled data. Thus we can also use, the joint probability 
of the sequence x and the labeling y given the model: 

p ( x > y I e) = £ p (x, y ,K | e) = X p ( x ^ 1 0) = Z II (4) 

n jen 7zg n j=i 

The idea behind this approach is that summation has to be done only over those 

paths 77 v that are in agreement with the labels y. If multiple sequences are available 
for training (which is usually the case), they are assumed independent, and the total 
likelihood of the model is just a product of probabilities of the form (3) and (4) for 
each of the sequences. The generalization of Equations (3) and (4) from one to many 
sequences is therefore trivial, and we will consider only one training sequence x in 
the following. 

2.1 Maximum Likelihood and Conditional Maximum Likelihood 

The Maximum Likelihood (ML) estimate for any arbitrary model parameter, is de- 
noted by: 

O'" = arg max P (x | d) 

0 

The dominant algorithm for ML training is the elegant Baum-Welch algorithm [6]. 
It is a special case of the Expectation-Maximization (EM) algorithm [7], proposed for 
Maximum Likelihood (ML) estimation for incomplete data. The algorithm, updates 
iteratively the model parameters (emission and transition probabilities), using their 
expectations, computed with the use of forward and backward algorithms. Conver- 
gence to at least a local maximum of the likelihood is guaranteed, and since it re- 
quires no initial parameters, the algorithm needs no parameter tuning. It has been 
shown, that maximizing the likelihood with the Baum-Welch algorithm can be done 
equivalently with a gradient descent method [8]. It should be mentioned here, as it is 
apparent from the above equations where the summation is performed over the entire 
training set, that we consider only batch (off-line) mode of training. Gradient descent 
could also be performed on on-line mode, but we will not consider this option in this 
work since the collection of heuristics we present are especially developed for batch 
mode of training. 
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In the CML approach (which is usually referred to as discriminative training) the 
goal is to maximize the probability of correct labeling, instead of the probability of 
the sequences [9, 10, 12]. This is formulated as: 



When turning to negative log-likelihoods this is equivalent to minimizing the 
difference between the logarithms of the quantities in Equations (4) and (3). Thus the 
log-likelihood can be expressed as the difference between the log-likelihood in the 
clamped phase and that of the. free-running phase [9] 



The maximization procedure cannot be performed with the Baum-Welch algorithm 
[9, 10] and a gradient descent method is more appropriate. The gradients of the log- 
likelihood, w.r.t. the transition and emission probabilities according to [12] are: 



The superscripts c and / in the above expectations correspond to the clamped and 
free-running phase discussed earlier. The expectations A, E, are computed as de- 
scribed in [12], using the forward and backward algorithms [1,2]. 

2.2 Gradient Descent Optimization 

By calculating the derivatives of the log-likelihood with respect to a generic parame- 
ter 0 of the model, we proceed with gradient-descent and iteratively update these 
parameters according to: 
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where t] is the learning rate. Since the model parameters are probabilities, performing 
gradient descent optimization would most probably lead to negative estimates [8]. To 
avoid the risk of obtaining negative estimates, we have to use a proper parameter 
transformation, namely the normalization of the estimates in the range [0,1] and per- 
form gradient-descent optimization on the new variables [8,12]. For example, for the 
transition probabilities, we obtain: 

exp(zj 

t, = Zexp(z,) (13) 

/’ 



Now, doing gradient descent on z’s, 
(m) _ jo 



d£ W 

dz,, 



yields the following update formula for the transitions: 
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(15) 



The gradients with respect to the new variables z, k] can be expressed entirely in 
terms of the expected counts and the transition probabilities at the previous iteration. 
Similar results could be obtained for the emission probabilities. Thus, when we train 
the model according to the CML criterion the derivatives of the log-likelihood w.r.t. a 
transition probability is: 




Substituting now Equation (16) into Equation (15), we get an expression entirely 
in terms of the model parameters and their expectations. 
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(17) 



The last equation describes the update formula for the transitions probabilities ac- 
cording to CML training with standard gradient descent [12], The main disadvantage 
of gradient descent optimization is that it can be very slow [11], In the following 
section we introduce our proposal for a faster version of gradient descent optimiza- 
tion, using information included only in the first derivative of the likelihood function. 
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These kinds of techniques have been proved very successful in speeding up the con- 
vergence rate of back-propagation in multi-layer perceptrons, and at the same time 
they are also improving the stability during training. However, even though they are a 
natural extension to the gradient descent optimization of HMMs, to our knowledge no 
such effort has been done in the past. 

3 Individual Learning Rate Adaptation 

One of the greatest problems in training large models (HMMs or ANNs) with gradi- 
ent descent is to find an optimal learning rate [13]. A small learning rate will slow 
down the convergence speed. On the other hand, a large learning rate will probably 
cause oscillations during training, finally leading to divergence and no useful model 
would be trained. A few ways of escaping this problem have been proposed in the 
literature of the machine learning community [14]. One option is to use some kind of 
adaptation rule, for adapting the learning rate during training. This could be done for 
instance starting with a large learning rate and decrease it by a small amount at each 
iteration, forcing it though to be the same for every model parameter, or alternatively 
to adapt it individually for every parameter of the model, relying on information in- 
cluded in the first derivative of the likelihood function [14], Another approach is to 
turn on second order methods, using information of the second derivative. Here we 
consider only methods relying on the first derivative, and we developed two algo- 
rithms that use individual learning rate adaptation that are presented below. 

The first, denoted as Algorithm 1, alters the learning rate according to the sign of 
the two last partial derivatives of the likelihood w.r.t. a specific model parameter. 
Since we are working with transformed variables, the partial derivative, which we 
consider, is that of the new variable. For example, speaking for transition probabili- 
ties we will use the partial derivative of the likelihood w.r.t. the z k , and not w.r.t. the 
original a kl . If the partial derivative possesses the same sign for two consecutive 
steps, the learning rate is increased (multiplied by a factor of a + >1), whereas if the 
derivative changes sign, the learning rate is decreased (multiplied by a factor of a" 
<1). In the second case, we set the partial derivative equal to zero and thus prevent an 
update of the model parameter. This ensures that in the next iteration the parameter is 
modified according to the reduced learning rate, using though the actual gradient. We 
chose to have the learning rates bound by some minimum and maximum values de- 
noted by the parameters tj mjn and t] max . In the following section Algorithm 1 is pre- 
sented, for updating the transition probabilities. It is completely straightforward to 
derive the appropriate expressions for the emission probabilities as well. In the fol- 
lowing, the sign operator of an argument returns 1 if the argument is positive, - 1 if it 
is negative and 0 otherwise, whereas min and max operators are the usual minimum 
and maximum of two arguments. 

The second algorithm denoted Algorithm 2, constitutes a more radical approach 
and is based on a modified version of the RPROP algorithm [15]. The RPROP algo- 
rithm is perhaps the fastest first-order learning algorithm for multi-layer perceptrons, 
and it is designed specifically to eliminate the harmful influence of the size of the 
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partial derivative, on the weight step-size [14, 15]. Algorithm 2 is almost identical to 
the one discussed above, with the only difference being the fact that the step-size (the 
amount of change of a model parameter) at each iteration is independent of the mag- 
nitude of the partial derivative. Thus, instead of modifying the learning rate and mul- 
tiplying it by the partial derivative, we chose to modify directly an initial step-size for 
every model parameter denoted by A, and then use only the sign of the partial deriva- 
tive to determine the direction of the change. Algorithm 2 is presented below. Once 
again we need the increasing and decreasing factors a + and a' and the minimum and 
maximum values for the step-size, now denoted by A min and A max respectively. 
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It should be mentioned that the computational complexity and memory require- 
ments of the two proposed algorithms is similar to standard gradient descent for CML 
training. The algorithms need to store only two additional matrices with dimensions 
equal to the total number of the model parameters, the matrix with the partial deriva- 
tives at the previous iteration and the matrix containing the individual learning rates 
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(or step-sizes) for every model parameter. In addition, a few additional operations are 
required per iteration compared to standard gradient descent. In HMMs the main 
computational bottleneck is the computation of the expected counts, requiring run- 
ning the forward and backward algorithms. Thus, the few additional operations and 
memory requirements introduced here are practically negligible. 



Algorithm 2 



for each k 

{ 



for each / 

{ 



* w dl™ 

if . >0, then 

dz H dz « 

A « ( ° = min ( A i/ (r -° .a + , A max ) 

} 

dt” 3 * (w) 

else if . <0, then 



{ 



dz u dz t , 



A h (,) =max(A i , ( '- |) .a-,A mm ) 

dt W 



dz,_, 



= 0 



or'' 1 exp 



(f+l) 

a), = - 



-sign 



f M (0A 

V 5z « J 



A 



(0 



Z (<) 

a kr exp 



( dl WA 



-sign 



V 3z v J 



A„, 



(>) 



4 Results and Discussion 

In this section we present results comparing the convergence speed of our algorithms 
against the standard gradient descent. We apply our proposed algorithms in a real 
problem from molecular biology, training a model to predict the transmembrane re- 
gions of (3-barrel membrane proteins [16]. These proteins are localized on the outer 
membrane of the gram-negative bacteria, and their transmembrane regions are formed 
by antiparallel, amphipathic (3-strands, as opposed to the a-helical membrane pro- 
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terns, found in the bacterial inner membrane, and in the cell membrane of eukaryotes, 
that have their membrane spanning regions formed by hydrophobic a-helices [17]. 

The topology prediction of [S-barrel membrane proteins, i.e. predicting precisely 
the amino-acid segments that span the lipid bilayer, is one of the hard problems in 
current bioinformatics research [16]. The model that we used is cyclic with 61 states 
with some of them sharing the same emission probabilities (hence named tied states). 
The full details of the model are presented in [18]. We have to note, that similar 
HMMs, are found to be the best available predictors for a-helical membrane protein 
topology [19], and this particular method, currently performs better for P-barrel 
membrane protein topology prediction, outperforming significantly, two other Neural 
Network-based methods [20]. For training, we used 16 non-homologous outer mem- 
brane proteins with structures known at atomic resolution, deposited at the Protein 
Data Bank (PDB) [21]. The sequences x, are the amino-acid sequences found in PDB, 
whereas the labels y required for the training phase, were deduced by the three di- 
mensional structures. We use one label for the amino-acids occurring in the mem- 
brane-spanning regions (TM), a second for those in the periplasmic space (IN) and a 
third for those in the extracellular space (OUT). In the prediction phase, the input is 
only the sequence x, and the model predicts the most probable path of states with the 
corresponding labeling y, using the Viterbi algorithm [1, 2]. 

For standard gradient descent we use learning rates (rj), ranging from 0.001 to 0.1 
for both emission and transition probabilities, whereas the same values were used for 
the initial parameters t]° (in algorithm 1) and A 0 (in algorithm 2), for every parameter 
of the model. For the two algorithms that we proposed, we additionally used a + =1.2 
and a =0.5, for increasing and decreasing factors, as originally proposed for the 
RPROP algorithm, even though the algorithms are not sensitive to these parameters. 
Finally, for setting the minimum and maximum allowed learning rates we used t] mjn 
(algorithm 1) and A min (algorithm 2) equal to 10‘ 20 and tj (algorithm 1) and A max 
(algorithm 2) equal to 10. 

The results are summarized in Table 1 . It is obvious that both of our algorithms 
perform significantly better than standard gradient descent. The training procedure 
with the 2 newly proposed algorithms is more robust, since even choosing a very 
small or a very large initial value for the learning rate, the algorithm eventually con- 
verges to the same value of negative log-likelihood. This is not the case for standard 
gradient descent, since a small learning rate 0/ = 0.001) will cause the algorithm to 
converge extremely slowly (negative log-likelihood equal to 391.5 at 250 iterations) 
or even get trapped in local maxima of the likelihood, while at the same time a large 
value will cause the algorithm to diverge (for t] > 0.03). In real life applications, one 
has to conduct an extensive search in the parameter space in order to find the optimal 
problem-specific learning rate. It is interesting to note, that no matter the initial values 
of the learning rates we used, after 50 iterations, our 2 algorithms, converge to ap- 
proximately the same negative log-likelihood, which is in any case better compared to 
that obtained by standard gradient descent. Furthermore, we should mention that 
Algorithm 1 diverged only for t] kl = 0.1, whereas Algorithm 2 did not diverge in the 
range of the initial parameters we used. 
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Table 1 . Evolution of negative log-likelihoods for algorithm 1 , algorithm 2 and standard gradi- 
ent descent, using different initial values for the learning rate, #: negative log-likelihood greater 
than 10000, meaning that the algorithm diverged 
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On the other hand, the 2 newly proposed algorithms are much faster than standard 
gradient descent. From Table 1 and Figure 1, we observe that standard gradient de- 
scent, even when an optimal learning rate has been chosen (;/ = 0.02), requires as 
much as five times the number of iterations in order to reach the appropriate log- 
likelihood. We should mention here, that in real life applications, we would have 
chosen a threshold for the difference in the log-likelihood, between two consecutive 
iterations (for example 0.01). in such cases the training procedure would have been 
stopped much earlier, using the two proposed algorithms, than with the standard gra- 
dient descent. 

We should note here, that the observed differences in the values of negative log- 
likelihood correspond also to better predictive performance of the model. By the use 
of our two proposed algorithms the correlation coefficient for the correctly predicted 
residues, in a two state mode (transmembrane vs. non-transmembrane), ranges be- 
tween 0.848-0.851, whereas for standard gradient descent ranges between 0.819- 
0.846. Similarly, the fraction of the correctly predicted residues in a two state mode 
ranges between 0.899-0.901, while at the same time the standard gradient descent 
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Fig. 1. Evolution of the negative log-likelihoods obtained using the three algorithms, with the 
same initial values for the learning rate (0.02). This learning rate was found to be the optimal 
for standard gradient descent. Note that after 50 iterations the lines for the two proposed algo- 
rithms are practically indistinguishable, and also that convergence is achieved much faster, 
compared to standard gradient descent 

yields a prediction in the range of 0.871-0.885. In all cases these measures were com- 
puted without counting the cases of divergence, where no useful model could be 
trained. Obviously, the two algorithms perform consistently better, irrespective of the 
initial values of the parameters. 

If we used different learning rates for the emission and transition probabilities, we 
would probably perform a more reliable training for standard gradient descent. Unfor- 
tunately, this would result in having to optimize simultaneously two parameters, 
which it would turn out to require more trials for finding optimal values for each one. 
Our two proposed algorithms on the other hand, do not depend that much on the ini- 
tial values, and thus this problem is not present. 



5 Conclusions 

We have presented two simple, yet powerful modifications of the standard gradient 
descent method for training Hidden Markov Models, with the CML criterion. The 
approach was based on individually learning rate adaptation, which have been proved 
useful for speeding up the convergence of multi-layer perceptrons, but up to date no 
such kind of study have been performed on HMMs. The results obtained from this 
study are encouraging; our proposed algorithms not only outperform, as one would 
expect, the standard gradient descent in terms of training speed, but also provide a 
much more robust training procedure. Furthermore, in all cases the predictive per- 
formance is better, as judged from the measures of the per-residue accuracy men- 
tioned earlier. In conclusion, the two algorithms presented here, converge much faster 




Faster Gradient Descent Training of Hidden Markov Models 5 1 



to the same value of the negative log-likelihood, and produce better results. Thus, it is 
clear that they are superior compared to standard gradient descent. Since the required 
parameter tuning is minimal, without increasing the computational complexity or the 
memory requirements, our algorithms constitute a potential replacement for the stan- 
dard gradient descent for CML training. 
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Abstract. The aim of this paper is to try to understand the process of 
children’s language acquisition by using the theory of inference of formal 
grammars. Toward this goal, we introduce an extension of Marcus Exter- 
nal Contextual grammars which constitutes a Mildly Context-Sensitive 
language family, and study their learnability in the limit from positive 
data. Finally, we briefly indicate our future research direction. 



1 Introduction 

Grammatical Inference is known as one of the most attractive paradigms of 
scientific learning that is nowadays a classical, but still active, discipline. It refers 
to the process of learning grammars and languages from data. Gold originated 
this study trying to construct a formal model of human language acquisition. A 
remarkable amount of research has been done since his seminal work to establish 
a theory of Grammatical Inference, to find effective and efficient methods for 
inferring grammars, and to apply those methods to practical problems (i.e.. 
Natural Language Processing, Computational Biology). Grammatical Inference 
has been investigated within many research fields, including machine learning, 
computational learning theory, pattern recognition, computational linguistics, 
neural networks, formal language theory, and many others (see [1]). 

Grammatical inference and linguistic studies are close in proximity, especially 
with linguistic studies of Chomsky’s inspiration. These studies conceive grammar 
as a machine (in the sense of the Theory of Formal Languages) that children 
develop and reconstruct very fast during the first years of their lives. Children 
infer and select the grammar of their language from the data (from the examples 
of the language) that the surrounding world offers them, but the facility with 

* This research was supported by a FPU Fellowship from the Spanish Ministry of 
Education and Science. 
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which children acquire language belies the complexity of the task. This idea 
accords with the belief of the biologically determined character of the human 
linguistic capacity. 

Hence, the study of grammatical inference is highly interdisciplinary, draw- 
ing from computer science, linguistic and cognitive science. We will try to bring 
together the Theory of the Grammatical Inference and Studies of Language Ac- 
quisition (which are connected areas, but they belong to different scientific tra- 
ditions) . 



2 Which Formal Grammars Are Adequate 
to Describe Natural Languages? 

The question of determining the location of natural languages in the Chomsky 
hierarchy has been subject of debate for a long time. Several authors have proven 
the non-context-freeness of natural languages presenting examples of natural 
language structures that cannot be described using a context-free grammar. 
Next, we will present several such non-context-free structures [2]: 

Dutch: The following example shows a duplication-like structure {ww \ w £ 
{a, 6}*}, where w is the word obtained from w by replacing each letter with its 
barred copy. 

...dat Jan Piet Marie de Kinderen zag helpen laten zwemmen 
(That Jan saw Piet help Marie make the children swim) 

This is only weakly non-context-free, i.e., only in the deep structure. 

Bambara: A duplication structure is found in the vocabulary of the African 
language Bambara, demonstrating a strong non-context-freeness, i.e., on the sur- 
face and in the deep structure: 

malonyininafilela o malonyininafilela o 

(one who searches for rice watchers + one who searches for rice watchers = 
whoever searches for rice watchers) 

This has the structure {wcw \ w £ {a,b}*}. But also the crossed agreement 
structure {a n b m c n d m | m, n>0 } can be inferred. 

Swiss German: The following example is a strong non-context-free structure, 
again showing crossed agreement: 

Jan sait das mer (d’chind) m (em Hans) n es huus haend wele (laa) m (hdlfe) n 
aastriiche 

(Jan said that we wanted to let the children help Hans paint the house) 

This has the structure xwa rn b n yc m d n z, where a, b stand for accusative, dative 
noun phrases, respectively, and c, d for the corresponding accusative, dative verb 
phrases, respectively. 

The Chomsky hierarchy does not provide a specific demarcation of language 
families having the desired properties [2]. The family of context-free languages 
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has good computational properties, but it does not contain some important 
formal languages that appear in human languages. On the other hand, the family 
of context-sensitive languages contains all important constructions that occur in 
natural languages, but it is believed that the membership problem for languages 
in this family cannot be solved in deterministic polynomial time. 

The difficulty of working with context-sensitive grammars has obliged re- 
searchers to look for ways to generate non-context-free structures using context- 
free rules. This idea has led to Regulated Rewriting in Formal Language Theory 
[3] and to the so-called Mildly Context-Sensitive Devices in Linguistics [4]. 

Hence, the Mildly Context-Sensitive languages are the most appropriate to 
describe natural language because they: (a) include non-context-free construc- 
tions that were found in the syntax of natural languages; (b) are computationally 
feasible, i.e., the membership problem for them is solvable in deterministic poly- 
nomial time. 

In this paper, by a Mildly Context-Sensitive family of languages we mean a 
family C of languages that satisfies the following conditions [2] : 

(i) each language in C is semilinear 

(ii) for each language in C the membership problem is solvable in deterministic 

polynomial time 

(iii) C contains the following three non-context-free languages: 

- multiple agreements : Li = {a n b n c n \ n > 0} 

- crossed agreements: L 2 = {a n b rn c n d m \ n,m > 0} 

- duplication: L 3 = {ww \ w € {a, 6}*} 

There exist different mechanisms to fabricate mildly context-sensitive fam- 
ilies: tree adjoining grammars ([4]), head grammars [5], combinatory categorial 
grammars [6], linear indexed grammars [7], simple matrix grammars [8], etc. We 
will study another mechanism, which is a natural extension of Marcus External 
Contextual grammars [2]. This mechanism is (technically) much simpler than 
any other models found in the literature on Mildly Context-Sensitive families 
of languages. It does not involve nonterminals, and it does not have rules of 
derivation except one general rule: to adjoin contexts. Although this mecha- 
nism generates a proper subclass of simple matrix languages, it is still Mildly 
Context-Sensitive. 



3 Learnability Model 

to Study Children’s Language Acquisition 

Within Computational Learning Theory, there are three major established for- 
mal models for learning from examples or for grammatical inference [1]: 

Gold: identification in the limit model [9]. In the second half of the 1960’s, 
Gold first formulated the process of learning formal languages. Motivated by 
observing children’s language learning process, he proposed an idea that learning 
is the process of making guesses of grammars and it does not terminate in a finite 
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number of steps. It is only able to converge at a correct grammar in the limit. 
Identification in the limit provides a learning model where an infinite sequence 
of examples of the unknown grammar G is presented to the inference algorithm 
M, and the eventual or limiting behavior of the algorithm is used as the criterion 
of its success. 

Angluin: query learning model [10]. Angluin considered a learning paradigm 
in which the learner has access to an expert teacher. The teacher is a fixed set of 
oracles that can answer specific kinds of queries made by the learner (inference 
algorithm) on the unknown grammar G. Typical types of queries include the 
following: 

(i) Membership. The input is a string w £ £* and the output is “yes” if w is 
generated by G and “no” otherwise. 

(ii) Equivalence. The input is a grammar G ’ and the output is “yes” if G ’ gen- 
erates the same language as G (G’ is equivalent to G) and “no” otherwise. 
If the answer is “no” a string w in the symmetric difference of the language 
L(G) generated by G and the language L(G’) generated by G’is returned. 
This returned string w is called a counterexample. 

Valiant: PAC learning model [11]. Valiant introduced probably approximately 
correct learning (PAC learning, in short), which is a distribution-independent 
probabilistic model of learning from random examples. In this model, the infer- 
ence algorithm takes a sample as input and produces a grammar as output. A 
successful inference algorithm is one that with high probability (at least 1 - <5) 
finds a grammar whose error is small (less than e). 

What model is the most adequate to study children’s language acquisition? 
We will distinguish two stages in the process of language acquisition: 

1st stage: learning from positive data. It is widely believed that children 
do not receive negative examples (examples of sentences that are not in the 
language). They get only positive data from the surroundings, namely linguistic 
constructions that are grammatically correct. 

2nd stage: learning from correction queries. Children need to commu- 
nicate more complex ideas, and therefore must increase the complexity of their 
constructions beyond the level acquired in the first stage. Hence, in this stage, 
receiving only positive data is not enough: they need to ask questions about the 
grammar of their language. 

Therefore, we propose here a novel learning model inspired by Gold’s model 
and Angluin’s model: learnability in the limit from positive data and correction 
queries. 

Correction queries are an extension of membership queries. The output in a 
membership query is “yes” if the string is generated by a unknown grammar G 
and “no” otherwise. In the case of correction queries, if the answer is “no”, then 
a corrected string is returned. However, in this paper, we will study the learn- 
ability in the limit from only positive data. See the conclusion for a discussion 
of corrected strings. 
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4 Basic Definitions 

The following basic notations and definitions 1-4 are taken directly from [2]. 

For an alphabet £, let £* be the free monoid generated by £ with the 
identity A. The free semigroup generated by £ is £ + — £* - {A}. Elements in 
£*)£ + ) are referred to as words ( nonempty words)', A is the empty word. Assume 
that a £ £ and w £ £*', the length of w is denoted by ui|, while the number of 
occurrences of a in w is denoted by |w| a . A context is a pair of words, i.e. , (u, v), 
where u,v £ £* . N denotes the set of natural numbers. 

A minimal linear language is a language generated by a minimal linear gram- 
mar, that is, a linear grammar with just one nonterminal. 

The families of regular, minimal linear, linear, context-free, and context- 
sensitive languages are denoted by REG, MinLIN, LIN,CF, and CS, respec- 
tively. 

Assume that £ = {ai,a 2 , ...,afc}. The Parikh mapping, denoted by ip, is: 

Ip: £* * N k , 1 p(w) = (Hod Ma 2 , •••, MaJ- 

If L is a language, then the Parikh set of L is defined by: 
ip{L) = {ip(w) | w e L}. 

A linear set is a set M C N k such that M = {To + v i x i I x i G N }, for 

some Vo, v\, ..., v m in N k . A semilinear set is a finite union of linear sets, and a 
semilinear language is a language L such that ip{L) is a semilinear set. 

The reader can find supplementary information regarding the basic notions 
on formal languages we use in this paper in [12]. 



4.1 Marcus External Contextual Grammars 

Contextual grammars were firstly considered in Marcus [13] to model some nat- 
ural aspects from descriptive linguistics, for instance, the acceptance of a word 
(construction) only in certain contexts. There are many variants of Marcus Con- 
textual grammars, but all of them are based on context adjoining. The differences 
are in the way of adjoining contexts, the sites where contexts are adjoined, the 
use of selectors, etc. For a detailed introduction to the topic, the reader is referred 
to the monograph [14]. 

Definition 1 A Marcus External Contextual grammar is G = ( £,B,C ), where 
£ is the alphabet of G, B is a finite subset of £* called the base of G, and C is 
a finite set of contexts, i.e. a finite set of pairs of words over £. C is called the 
set of contexts of G. 

The direct derivation relation with respect to G is a binary relation between 
words over £, denoted or => if G is understood from the context. By defini- 
tion, x =>g y< where x, y £ £* , iff y = uxv for some (u,v) £ C. The derivation 
relation with respect to G, denoted =>£., or = ^ > * */ @ understood from the con- 
text, is the reflexive and transitive closure of=>c- 
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Definition 2 Let G = (E, B, C) be a Marcus External Contextual grammar. 
The language generated by G, denoted by L(G), is defined as: 

L(G ) = { y £ E* | there exists x £ B such that x =>q y }. 

One can verify that the language generated by G = (E, B , C) is the smallest 
language L over E such that: 

(i) B CL 

(ii) if a : £ L and (u, v) £ C, then uxv £ L 

The family of all Marcus External Contextual languages is denoted by EC. 

Remark 1. EC = MinLIN, which is a strict subfamily of LIN, incomparable 
with REG (see [14]). 

4.2 Marcus Many-Dimensional External Contextual Grammars 

Marcus many-dimensional External Contextual grammars are an extension of 
Marcus External Contextual grammars, but they work with vectors of words 
and vectors of contexts [2] . 

Let p > 1 be a fixed integer, and let E be an alphabet. A p-uiord x over 
A is a p-dimensional vector whose components are words over E, i.e., x = 
{x\,X 2 , x p ), where Xi £ E* , 1 < i < p. A p-context c over A is a p-dimensional 
vector whose components are contexts over E, i.e., c = [ci, C 2 , c p \ where c* = 
(ui, Vi), Ui, Vi £ E* , 1 < i < p. We denote vectors of words with round brackets, 
and vectors of contexts with square brackets. 

Definition 3 Let p >1 be an integer. A Marcus p-dimensional External Con- 
textual grammar is G = ( E,B,C ), where E is the alphabet of G, B is a finite 
set of p-words over E called the base of G, and C is a finite set of p-contexts 
over E. C is called the set of contexts of G. 

The direct derivation relation with respect to G is a binary relation be- 
tween p-words over E, denoted by =$>g, or => if G is understood from the con- 
text. Let x = (xi,X2,—,x p ) and y = (yi, y^, ■■■, y p ) be two p-words over E. 
By definition, x =><3 y iff y = {u\X\V\, U2X2V2, ■■■, u p x p v p ) for some p-context 
c = [(ui,Ui), (u 2 ,V 2 ), ■■■, (u p ,v p )] £ C. The derivation relation with respect to G, 
denoted by =>q, or =>* if no confusion is possible, is the reflexive and transitive 
closure of =>g- 

Definition 4 Let G = ( E , B, C ) be a Marcus p-dimensional External Contex- 
tual grammar. The language generated by G, denoted L(G), is defined as: 

L(G ) = {y £ E* | there exists (x\,X 2 , ■■■, x p ) £ B such that {x\,X 2 , ■■■, x p ) =>q 
(yi,y2,-,y p ) and y = ym-yp}- 

The family of all Marcus p-dimensional External Contextual languages is 
denoted by £C p . 

The trivial (empty) context [(A, A), (A, A), ..., (A, A)] is not necessary, therefore 
we will not consider in the remainder of this paper. 
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Remark 2. Note that EC = SC\ and SC P C SC q , for all 1 < p < q (see [2]). 

Remark 3. Any family SC P for p > 2 is a subfamily of linear simple matrix 
languages (see [2]). 



5 The Simple p-Dimensional External Contextual Case 

It is clear that not all regular languages over the alphabet of, say, English are 
subsets of it. Certainly, some authors pointed out this in the past in a more or 
less informal way (see [15]). Hence, natural languages could occupy an eccentric 
position in the Chomsky hierarchy. Therefore, we need a new hierarchy, which 
should certainly hold strong relationships with the Chomsky hierarchy, but which 
should not coincide with it. In a certain sense, the new hierarchy should be 
incomparable with the Chomsky hierarchy and pass across it [2]. 

Since the families of languages generated by many-dimensional External Con- 
textual grammars have the property of transversality, they appear to be appro- 
priate candidates to model natural language syntax and, therefore, the most 
adequate towards our goal of understanding the process of children’s language 
acquisition. 

In this paper, we will study the learnability of this family of languages from 
only positive data. Therefore, we will understand the process of language acqui- 
sition in the first stage. 

It is desirable that learning can be achieved using only positive data, but it 
is generally impossible to learn a certain family of languages in the limit from 
positive data. Gold proves in [9] that, as soon as a class of languages contains 
all of the finite languages and at least one infinite language (called a superfi- 
nite class), it is not identifiable in the limit from positive data. Hence, regular, 
context-free and context-sensitive languages are not learnable from positive data 
in Gold’s model. They only can be identified in the limit when the learner has 
access to both positive and negative evidence. 

How do children overcome Gold’s theoretical hurdle? The answer could rely 
on the supposition that children do not need superfinite language, therefore 
learning from only positive data is possible. 

According to the general definition, the SC V grammar family is superfinite, 
since the base of G can be any finite set of p-words. Hence, we need to put some 
restrictions to make it possible to learn this class in the limit from only positive 
data. 

Definition 5 A Simple Marcus p-dimensional External Contextual grammar is 
G = ( E,B,C ), where E is the alphabet of G, B is a singleton of p-words over 
E called the base of G, and C is a finite set of p- contexts over E. C is called the 
set of contexts of G. 

The family of all Simple Marcus p-dimensional External Contextual lan- 
guages is denoted by SEC p . 
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Fig. 1 . The S£C P family occupies an eccentric position in the Chomsky hierarchy. 



Note 1. In what follows, we use the notation S£C P for referring to families of 
both languages and grammars, as far as no confusion arises from the context. 

Since the base of G is a singleton, this class is not superfinite (as seen later). 
Moreover, even if B is a singleton, it is enough to describe the three non-context- 
free languages described in the third condition on page 3 (a necessary condition 
for a family C of languages to be a Mildly Context-Sensitive family of languages ). 
It is easy to construct S£C P grammars for each of these languages, but we omit 
the proof due to lack of space. 

The properties of this class with which we propose to study language acqui- 
sition shows its adequacy to develop our goal. Like an £C P grammar, a S£C P 
grammar has the following properties: 

(a) Any family S£C P for p > 2 is a subfamily of linear simple matrix languages. 

(b) For every integer p >2, the family S£C p is a mildly context-sensitive family 
of languages. 

(c) If 1 < p < 9 , then S£C p C S£C q (i.e., the inclusion is proper). 

(d) The families ( S£C p ) p > 2 define an infinite hierarchy of mildly context- 
sensitive languages. 

(e) S£C p is strictly contained in the family CS and is incomparable with the 
families CF and REG. 

All these properties are immediately true due to [2]. 

Moreover, the S£C P grammar has another property with regard to £C P gram- 
mars. We can find some languages showing the proper inclusion: 

S£C P C £C P 

For example, L = {a, b, c}. It is generated by an £C P grammar, but never 
could be generated by a S£C p grammar because of the restricted features of 
S£C P grammars. This demonstrates that S£C P is not superfinite. 

Figure 1 shows the location of the S£C p family in the Chomsky hierarchy. 



5.1 Learnability of S£C P Languages from Only Positive Data 

Now, we direct our attention to see if the class of languages generated by S£C P 
grammars is learnable from positive data. 
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From a result by Shinohara [16], the class of languages generated by CS 
grammars with a fixed number of rules is learnable from only positive data. 
Hence, if we can transform a given S£C grammar with dimension p and degree 
q into an equivalent LSMG (linear simple matrix grammar [3]) with dimension 
p ’ and degree q ’ and this into an equivalent CS grammar with a fixed number 
of rules, we will achieve our goal. 

First, we need to define p, q, p’ and q\ 

(i) S£C p , q : 

p: dimension (in the same sense as S£C P ), 
q: degree (the number of contexts). 

(ii) CSMG P ',q'- 

p’: number of nonterminals in the right hand of the unique rule of the 
LSMG started by S. 
q’: number of matrices. 

We will give the following constructive demonstration to prove that S£C p , q 
C CSMQp'y C CS grammars with a fixed number of rules. 

Let G = (if, B, C) be a S£C p ^ q grammar, where 

-B = {(7i,...,7 P )} 

- C = { ci = [(a], P\), ..., (a£, Pl)\, ..., c q = [(a?, p\), ..., (a«, /?«)] } 

We can transform this S£C grammar with dimension p and degree q into an 
equivalent LSMG with dimension p’ and degree q’. 



G’ = 


(N h , 


.., N p , E 


, P, S), where 


- P = 


= {S- 


-> Ar... 


■Ap, 




(Ar- 


— ► 


A 

> sip 


— * 7 P ), 




(Ar- 


— ► <*i 


Ai 01, 


, A p > 


WpAppp), 


(•■■), 








(Ar- 


— » a?Ai/^, ... 


, A p > 


aq P A pPp)} 


for A 


i € Ni 




pi e s*, 


1 < i < p, 



The number of rules of an equivalent CSG will be proportional to p’ • q : . 
Generally, there exists a CSG with the number of rules < k ■ p' ■ q' (k is a 
constant) . 

We now illustrate this method using a grammar as follows. As a simple 
example, consider a S£C v ^ q with p — 2 and q = 2. 

Let G = ({a, b, c, d}, B, C) be a S£C p , q grammar, where 

- B = {( ab , cd)} 

- C = { d = [(a, A), (c, A)], c 2 = [(A, b), (A, d)] } 

Note that L(G) = {a m b n c rn d n \m, n > 0}. 
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We can transform this SSC grammar with dimension p and degree q into an 
equivalent LSMG with dimension p’ and degree q : . 

G' = ({ S , A, A'}, {a, b, c, d }, P , S), where 

- P = { mo: S — * AA’, 
mi: (A — * ab, A’ — > cd), 
m 2 : (A — * aA, A’ — * cA’), 
m 3 : (A — > Ab, A’ — > A’d) }. 

Now, we can construct a CSG: G” = ( Vjv, T, P’, S), where 
V N = {S, A, A', M, Ri,R 2 , R 3 } 

P’ = {S — » AM A 1 



AM — 


-> abR\ 


R\b — 


► bR 1 


R lC — 


cR-\ 


bM - 


Mb 


AM — 


aAR 2 


R 2 b — 


> bR 2 


R 2 c - 


-> cR 2 


cM — 


-»■ Me 


AM — 


-> AbR 3 


Rib — 


• bR 3 


R 3 c — 


cR 3 






i?iA' - 


— > Med 


RiA 1 - 


— > cd 










R 2 A' - 


-> McA' 


R 2 A' - 


— » cA! 










RzA' - 


-> MA'd 


RiA' 


— * A'd} 











Note that the set of rules presented here may contain some redundancy. 
However, we gave a priority to the consistency of the manner of constructing 
corresponding CSGs for general cases. 

It is easy to prove that L(G) = L{G') = L(G"), but we omit the proof due 
to lack of space. 

Hence, there are clear relationships between S£C p ^ q , CSMQ P ', q ' and CSG. 

(i) p' = p (in our example, p is equal to 2; therefore, the number of nonter- 
minals in the right hand of the unique rule of the LSMG started by S is 
2 )- 

(ii) q' = q+ 1 (in our example, q is equal to 2; therefore, the number of matrices 
of LSMG has to be 3). 

(iii) The fixed number of rules of CSG is proportional to p’-q’- Generally, one 
can have G" with 0{p' ■ q') number of rules. Since p' and q' are given, G" 
has a bounded number of rules. 

From a result by Shinolrara [16], we can obtain the following theorem: 

Theorem 1. Given p' > 0 and q' > 0, the class of languages generated by 
linear simple matrix grammars with dimension p' and degree q' is learnable from 
positive data. 

Corollary 1. Given p > 0 and q > 0, the class of languages generated by simple 
external contextual grammars with dimension p and degree q is learnable from 
positive data. 
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6 Conclusions 

The grammar introduced in this paper, S£C P grammar, is an appropriate can- 
didate to attain our goal: to understand the process of children’s language ac- 
quisition. 

We have briefly reported on research attempting to apply formal grammars 
to the study of natural language. We have seen that the most appropriate for- 
mal grammars to describe natural language are the Mildly Context-Sensitive 
grammars. One of the simpler mechanism to fabricate Mildly Context-Sensitive 
families is a natural extension of Marcus External Contextual grammars: the 
many-dimensional Marcus External Contextual grammars. 

We have also referred to the most adequate model to study children’s lan- 
guage acquisition. We have distinguished two stages in the acquisition of the lan- 
guage: (1) learning from only positive data; (2) learning from correction queries. 
So, we have proposed here a novel learning model inspired by the Gold’s model 
and Angluin’s model: learnability in the limit from positive data and correction 
queries. 

In this paper, we have studied the learnability from only positive data. To 
begin with, by making a certain restriction on the grammar, we have considered 
a subfamily of the Marcus p-dimensional External Contextual grammars, called 
Simple Marcus p-dimensional External Contextual grammars. 

Finally, we have shown that the class of languages generated by simple ex- 
ternal contextual grammars with fixed dimension and degree is learnable from 
positive data, from Shinohara’s results [16]. The learning algorithm straightfor- 
wardly derived from our main result is enumerative in nature and therefore not 
time-efficient, but we have obtained positive learnability result as the first step 
toward our final goal. 

Of special interest in the future will be: 

• To improve this algorithm and to obtain a time-efficient algorithm in prac- 
tical cases. 

• To study the learnability from correction queries and to explain the acqui- 
sition of language in the second stage. In our future research schema, by 
correction queries we intend to define as an oracle that takes a string w as 
input and produces as output a corrected string w c (if w is close to an ele- 
ment w c of the target L) and “No” (otherwise), where w being close to w c 
is defined by a certain measure (such as “one-letter” difference or Hamming 
distance) . 

• To connect the concept of approximately learning with linguistics motiva- 
tions (extending our results). This kind of learning could be appropriate for 
our purpose. 

In this way, we will see that the S£C P grammar is truly adequate to under- 
stand children’s language acquisition. 
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Abstract. This paper investigates the learnability of Pregroup Gram- 
mars, a context-free grammar formalism recently defined in the field 
of computational linguistics. In a first theoretical approach, we provide 
learnability and non-learnability results in the sense of Gold for sub- 
classes of Pregroup Grammars. In a second more practical approach, 
we propose an acquisition algorithm from a special kind of input called 
Feature-tagged Examples, that is based on sets of constraints. 

Keywords: Learning from positive examples, Pregroup grammars, Com- 
putational linguistics, Categorial Grammars, Context-Free grammars. 



1 Introduction 

Pregroup Grammars [1] (PGs in short) is a context-free grammar formalism 
used in the field of computational linguistics. This recently-defined formalism for 
syntax allies expressivity (in this respect it is close to Lambek Grammars) and 
computational efficiency. Subtle linguistic phenomena have already been treated 
in this framework [2,3]. PGs share many features with Categorial Grammars of 
which they are inheritors, especially their lexicalized nature. 

Since the seminal works of Kanazawa[4], a lot of learnability results in Gold’s 
model [5] have been obtained for various classes of Categorial Grammars and 
various input data. But the learnability of PGs has yet received very little at- 
tention except a negative result in [6] . In the first part of this paper, we prove 
several results of learnability or of non-learnability for classes of PGs. But these 
results are mainly theoretical and are not associated with learning algorithms. 

In the second part of the paper, we define an acquisition algorithm to specify 
a set of PGs compatible with input data. The input data considered, called 
Feature-tagged Examples, are richer than strings but chosen to be language- 
independent (inspired by [7-9]). The originality of the process is that it allows 
to reconsider the learning problem as a constraints resolution problem. 

* This research was partially supported by: “CPER 2000-2006, Contrat de Plan etat - 
region Nord/Pas-de-Calais: axe TACT, projet TIC”; fonds europeens FEDER “TIC 
- Fouille Intelligente de donnees - Traitement Intelligent des Connaissances” OBJ 
2-phasing out - 2001/3 - 4.1 - n 3. And by “ACI masse de donnees ACIMDD”. 
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2 Pregroup Grammars 

2.1 Background 

Definition 1 (Pregroup). A pregroup is a structure (P, <,-,l, r, 1) such that 
(P, <, •, 1) is a partially ordered monoid 1 and l, r are two unary operations on P 
that satisfy: Va € P: a l a< 1 < aa l and aa r < 1 < a r a. The following equations 
follow from this definition: Va, b £ P, we have a rl = a = a lr , l r = 1 = 
l l , (a • b) r = b r ■ a r , (a • 6)* =b l ■ a 1 . Iterated adjoints 2 are defined for i£ Z : 
a(°) =a , for i < 0 : a^ -1 ^ = (oW) ! ) f 0 r i > 0 : a^ +1 ) = (a^)’ 

Definition 2 (Free Pregroup). Let (P, <) &e a partially ordered set of prim- 
itive categories, P ^ = {pG | p £ P, i £ Z} zs ffte set of atomic categories and 
Cat^p^) = (P^ 2 )) = {Pi* 1 ^ • • • pi*”'* | 1 < k < n,pk £ P,ik € Z} is the set 
of categories. For X, Y £ Cat(p <y X < Y iff this relation is deducible in the 
system in Fig. 1 where p,q £ P, n,k € Z and X,Y, Z £ Cat^p <y This construc- 
tion, proposed by Buskowski, defines a pregroup that extends < on P to Cat^ P < y 



X < X ( Id ) 



XY < Z 

Xp (n) p (n+1) Y<Z 



X < Y Y <Z 
X < Z 



■ (Cut) 



X < YZ 



(Al) 


Xp (k) Y<Z 

7m ( INDl ) 

Xq {k) Y<Z 


(A r ) 


X<Yp (k) Z 

XT- ( JjVD «) 

X<Yq (k) Z 



X<Yp (n+1) p (n) Z 

q < p if k is even or p < q if k is odd 
Fig. 1. System for Pregroup Grammars 



Cut Elimination. Every derivable inequality has a cut-free derivation. 

Simple Free Pregroup. A simple free pregroup is a free pregroup where the 
order on primitive categories is equality. 

Definition 3 (Pregroup Grammars). (P, <) is a finite partially ordered set. 
A free pregroup grammar based on (P, <) is a lexicalized 3 grammar G = (X, /, s) 
such that s £ P ; G assigns a category X to a string v\ ■ ■ ■ v n of X* iff for 
1 < i < n, 3 Xi £ I(vi) such that X\ ■ ■ ■ X n < X in the free pregroup based on 
( P, <). The language C(G) is the set of strings in X* that are assigned s by G. 



Rigid and k - Valued Grammars. Grammars that assign at most k categories 
to each symbol in the alphabet are called k-valued grammars; 1-valued grammars 
are also called rigid grammars. 

Width. We define the width of a category C = pf 1 . . .p“ n as wd(C ) = n (the 
number of atomic categories). 

1 A monoid is a structure < M, •, 1 >, such that • is associative and has a neutral 
element 1 (Va; £ M : 1 • x = x ■ 1 = x). A partially ordered monoid is a monoid 
(M, •, 1) with a partial order < that satisfies Va, 6, c: a < b ==> c-a < c-b and a-c < b-c. 

2 We use this notation in technical parts 

3 A lexicalized grammar is a triple ( S,I,s ): E is a finite alphabet, I assigns a finite 
set of categories to each c £ E, s is a category associated to correct sentences. 
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Example 1. Our first example is taken from [10] with the basic categories: tt 2 = 
second person, si = statement in present tense, pi = present participle, p 2 = past 
participle, o = object. The sentence “You have been seeing her” gets category si 
(si < s), with successive reductions on tt 2 tt 2 < 1 , p l 2 P 2 < 1, p\pi < 1 , o l o < 1: 

You have been seeing her 
7T2 (tT2 S 1 P 2 ) (P2 Pi) (Pi O l ) O 

l_l U UU 

2.2 Parsing 

Pregroup languages are context-free languages and their parsing is polynomial. 
We present in this section a parsing algorithm working directly on lists of words. 
For that, we first extend the notion of inference to lists of categories, so as to 
reflect the separations between the words of the initial string. The relations noted 
r b n A where 1Z consists in one or several rules are defined on lists of categories 
(p, q are atomic, X, Y range over categories and P, A over lists of categories): 

M (merge): T, X , Y, A \~m P, XY, A. 

I (internal): P, Xp^ q^+^Y, A bj P, XY, A, if q < p and n is even or if 
p < q and n is odd. 

E (external): P, Xp( n \ g (n+1) Y, A h E P, X, Y, A, if q < p and n is even or if 
p < q and n is odd. 

\~ E is the reflexive-transitive closure of \~n. This system is equivalent with the 
deduction system when the final right element is a primitive category. As a 
consequence, parsing can be done using b * MIE - 

Lemma 1. For X £ Cat( P ^ and p £ P, X < p iff 3q £ P such that X \~mie 1 
and q < p. 

Corollary 1. G = (17, 1, s) generates a string v\---v n iff for 1 < i < N, 
3Xi £ I(vi) and 3 p £ P such that X\, • • • , X n \~* MIE p and p < s. 

All b j can be performed before V~* M and \-* E as the next lemma shows. 

Lemma 2. (easy) P\ b * MIE P 2 iff^A such that Pi b} A and A \~* ME P 2 

The external reductions corresponding to the same couple and a merge reduction 
can be joined together such that, at each step, the number of categories decreases. 

E+ (external+merge): For k £ N, 

P, Xp[ ni) ■ ■ -p^, q[ nk+1) ■ ■ ■ g|” 1+1 V, A \- E + P, XY, A , if q t < p t and n t is 
even or if p t < qi and rij is odd, for 1 < i < k. 

Lemma 3. For a list of categories P and p £ P, P \~* ME p iff P b^. + p. 

To define a polynomial algorithm, we finally constraint the application of \~* E + 
such that the width of the resulting category is never greater than the maximal 
width of the two initial categories: one category plays the role of an argument 
and the other plays the role of a functor even if the application is partial. The 
rule is thus called Functional. In fact, there is a small difference between left 
and right functional reductions (see the two different conditions wd(X) < k 
or wd(Y) < k) to avoid some redundancies. The last condition wd(Y) = 0 is 
necessary when A is empty and k = 0 to mimic a (degenerated) merge reduction. 
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F (functional): For k £ N, 

r, Xp ^ ni) • ■ ■ p^ k \ q[ nk+1) ■ ■ ■ q[ ni+1) Y , A b F T, XY, A , if q, < p z and is 
even or if pi < q, and n, is odd, for 1 < i < k and if wd{X) < k or 
wd{Y) < k or wd(Y) = 0. 

Lemma 4. For a list of categories r and p £ P, T \-* E+ p iff T \-* F p. 

Proof. The proof is based on the fact that for any planar graph where the ver- 
tice are put on a line and where the edges are only on one side of this line, there 
always exists at least one vertex that is connected only to one of its neighbours 
or to both of them but not to any other vertex. This vertex is then associated 
to its neighbour if it is connected to only one neighbour. If it is connected to its 
two neighbours, we choose the one that interacts the most with the vertex. 

The parsing of a string with n words consists in the following steps: 

1. Search for the categories associated to the n words through the lexicon. 

2. Add the categories deduced with b F 

3. Compute recursively the possible categories associated to a contiguous seg- 
ment of words of the string with b F . 

The third step uses a function that takes the positions of the first and last words 
in the segment as parameters. The result is a set of categories with a bounded 
width (i.e. by the maximum width of the categories in the lexicon). 

Property 1 For a given grammar, this algorithm is polynomial (wrt. the num- 
ber of words of input strings). 

Example 2. Parsing of “whom have you seen ?” . The categories are as follows in 
the lexicon (q' < s ): 

whom have you seen 
q'o ll q l 9P2 7r 2 n 2 P 20 1 





...whom ...have ...you ...seen 


seen... 

you... 

have... 

whom... 


{ P 20 1 } 

{tt 2 } 0 

{© 4 * 2 } UP 2 } {VI 

{q'o ll q 1 } 0 0 {q'} 



The cell of line i (numbered from the bottom) and column j contains the category 
computed for the fragment starting at the i th word and ending at the j th word. 

3 Learning 

3.1 Background 

We now recall some useful definitions and known properties on learning in the 
limit [5]. Let Q be a class of grammars, that we wish to learn from positive 
examples. Formally, let £(G) denote the language associated with a grammar G, 
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and let V be a given alphabet, a learning algorithm is a function <j> from finite sets 
of words in V* to Q, such that \/GgQ, V(ej)j 6 jv such that L{G) = (e/i^N 3 G' G Q 
and 3?r 0 SN such that Vn > no </({ei, . . . , e„}) = G' G Q and L(G') = £(G). 

Limit Points. A class CL of languages has a limit point iff there exists an infinite 
sequence < L n >„gjv of languages in CL and a language L G CL such that: 

Lq C Li . . . C L n C . . . and L = U n gjv n (L is a limit point of CL). 

If the languages of the grammars in a class Q have a limit point then the class 
Q is unlearnable in Gold’s model. 

Elasticity. A class CL of languages has infinite elasticity iff there exists (ei)j 6 N 
a sequence of sentences and (Lj) ie N a sequence of languages in CL such that: 

Vi G N : ei ^ Li and {ei,...,ej} C L i+1 . It has finite elasticity in the 

opposite case. If CL has finite elasticity then the corresponding class of grammars 
is learnable in Gold’s model. 

3.2 Non-learnability from Strings A Review 

The class of rigid (also fc-valuecl for any k) PGs has been shown not learnable 
from strings in [11] using [12]. So, no learning algorithm is possible. This has 
also been shown for subclasses of rigid PGs as summarized below (from [6]). 

Pregroups of Order n and of Order n+1/2. A PG on (P, <) is of order n G N 
when its primitive categories are in {a^\a G P , — n < i < n} ; it is of order 
n + 1/2, n G N when its primitive categories are in {a^V|a G P , —n— 1 < i < n}. 

Construction of Rigid Limit Points. We have proved [6] that the smallest 
such class (except order 0) has a limit point. Let P = {p,q,r,s} and S = 
{a, 6, c, d 1 ej. We consider grammars on (P, =): 



G n = {£, I n , s) 


G* = (£,/*, s) 


a e-> ( p l ) n q L 
b e- > qpq 1 
c i— > qr l 
d i— > rp l r l 
e i— > rp n s 


o^g‘ 
b i— > qp l q l 
c i— > qr l 
d i— > rpr 1 
e rs 



Theorem 1 The language of G* is a limit point for the languages of grammars 
G n on (P, =) in the class of languages of rigid simple free PGs of order 1/2 : 
for n > 0, L(G n ) = {ab k cd k e \ 0 < k < n} and L(G *) = {ab k cd k e \ k > 0}. 

Corollary 2. The classes CQ k n j 2 of k-valued simple free pregroups of order n/2, n 
> 0 are not learnable from strings. 

3.3 Learnability for Restricted Categories 

We consider three cases of restricted categories. Case (ii) is used in next section. 

(i) Width and Order Bounds. Here, by the order of a category G = 
pf 1 . . .p“ n we mean its integer order : maa;{ |u.i| / 1 < i < n}. 
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It is first to be noted that when we bind the width and the order of categories, 
as well as the number of categories per word (k- valued), the class is learnable 
from strings (since we have a finite number of grammars -up to renaming-). 

(ii) Width Bounded Categories. By normalizing with translations, we get: 
Theorem 2 The class of rigid (also k-valued for each k) PGs with categories 
of width less than N is learnable from strings for each N. 

Proof. Let G denote a rigid PG on L and n be the maximum width of G, we can 
show that G is equivalent (same language) to a similar PG of order < 2n\E\. This 
PG is a normalized version of G obtained by repeating translations as follows: 
consider possible iterations of r; if two consecutive exponents never appear in 
any iterated adjoints (a hole), decrease all above exponents; proceed similarly 
for iterations of l. Therefore, a bounded width induces a bounded order for rigid 
PGs; we then apply the learnability result for a class with a width bound and an 
order bound. In the fc-valued case, we proceed similarly with an order < 2n\E\k. 



(iii) Pattern of Category. We infer from known relationships between cate- 
gorial formalisms, a case of learnability from strings for PGs. We refer to [13, 
14] for definitions and details on formalisms. 

From Lambek Calculus to Pregroup. We have a transformation A — > [A] 
on formulas and sequents from Lg (Lambek calculus allowing empty sequents) to 
the simple free pregroup, that translates a valid sequent into a valid inequality 4 : 
[A] = A when A is primitive 
[A\B] = [AY[B] ; [B/A] = [B][A] 1 

[A 1 ,...,A n hB] = [A 1 }---[A n ]<[B] 

The order of a category o(A) for Categorial Grammars is: 

o(A) = 0 when A is primitive; o(A \ B)=o(B / A)=max(o(A) + l,o(B)) 

Lemma 5. [15] If B is primitive and o(Af) < 1 for 1 < i < n then: 

Ax, . . . ,A n 'tab B (Classical or AB Categorial Grammars) 

iff /1[ , /1„ - B (Lambek) 

iff [Ax] ■ ■ ■ [A n \ < B (simple free pregroup ) 

We infer the following result: 

Theorem 3 The class C'f of k-vahied (simple free) PGs with categories of the 
following pattern (Pi) : g^. . . glpd’i . . . d l m , where n > 0 and m > 0 is learnable 
from strings in Gold’s model. 

Proof. The class of k- valued AB grammars is learnable from strings [4] . Lemma 5 
shows that the class of PGs that are images of /c-valued Lambek Grammars of 
order 1 (also k- valued AB-grammars with the same language) is also learnable. 
And when o(A) < 1, then [A] must be written as: g r n . . . g\pd\ . . . d l m . 

Relevance of Pattern I\ . We have observed that many linguistic exam- 
ples follow the pattern Pi or are images of these by some increasing function, 

The converse is not true : 

[(a • b) / c] = abc 1 = [a ■ (b / c)] but (a - b) / c\f a • (b / c) 

[(P / (( P / P) / P )) / P]= pp u p a p l p l < ]p] but ( p / ((p / p) / p)) / p(fp 



4 
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i.e. a function h such that X < h(X) (for example type-raised introduction 
hraise(X) = ss l X)\ moreover if G assigns hi{ti) to c*, where all hi are increas- 
ing and all t t have the pattern Pi, we consider Gp 1 assigning t, to c, and get : 
L(G) C L(Gp 1 ) and the class of Gp 1 is learnable from strings. 



3.4 Learning Pregroup Grammars from Feature-Tagged Examples 

Previous learnability results lead to non tractable algorithms. But an idea from 
Categorial Grammars learning is worth being applied to PGs: the learnability 
from Typed Examples. Types are to be understood here in the sense Montague’s 
logic gave them. Under some conditions specifying the link between categories 
and types, interesting subclasses of AB-Categorial Grammars and of Lambek 
Grammars have been proved learnable from Typed Examples, i.e. from sentences 
where each word is associated with its semantic type [8,9]. 

To adapt this idea to PGs, the first problem is that the link between PGs 
and semantics is not clearly stated. So, the notion of semantic types has no 
obvious relevance in this context and our first task is to identify what can play 
the role of language-independent features in PGs. We call Feature-tagged Exam- 
ples the resulting input data. We then define a subclass of PGs learnable from 
Feature-tagged Examples in the sense of Gold. Finally, we present an algorithm 
whose purpose is to identify every possible PG of this class compatible with a 
set of Feature-tagged Examples. An original point is that this set will be spec- 
ified by a set of constraints. We provide examples showing that this set can be 
exponentially smaller than the set of grammars it specifies. 



Specification of Input Data. Let us consider how the various possible word 
orders for a basic sentence expressing a statement at the present tense, with a 
third person subject S, a transitive verb V and a direct object O would be treated 
by various PGs (Figure 2): The common points between every possible analysis 



S V O 

V l 

ft"3 7T3SlO O 

L-TU 



O V s 



O O r Sl7T3 7T3 

U'U 



S O V 



7T3 O O r TV s Si 

i u r 



VOS 



517130* O 7T3 




O S V 

0 7T3 TV30 r S\ 

1 u r 



V s o 

Sl_O l 7I3 7T3 O 

I U I 



Fig. 2. Pregroup Grammars and possible word orders 



are the primitive categories associated with S and O. The category of V is always 
a concatenation (in various orders) of the elements of the set {si, ir ^ , o v } where 
u and v are either r or l: this set simply expresses that V expects a subject and 
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an object. But the nature of the exponent (r or l or a combination of them) and 
their relative positions in the category associated with V are language-specific. 

This comparison suggests that multisets of primitive categories play the role 
of language-independent features in PGs. For any set (P, <), we call M{P) the 
set of multisets of elements of P and fp the mapping from Cat(p t <) to M(P) 
that transforms any category into the multiset of its primitive categories. 

Definition 4 . For any PG G = (S,I,s), the Feature-tagged, Language of G, 
noted FT{G), is defined by: FT{G) = {(vi,Ti)...(v n ,T n )\\/i £ { 1 , ..., n} 3 JQ £ 
I{vi) such that X\...X n < s and Ti = fp{Xf)} 

Example 3. Let P = {773,0, s, Si} with Si < s, E = {he, loves, her} and let 
G = ( S,I,s ) with I{he) = {773}, I(loves) = {7 tJsjV}, I {her ) = {o}. We 
have: {he, {ir 3 }){loves, {si, 773, o})(her, {o}) £ FT(G). An element of FT{G ) is a 
Feature-tagged Example. We study how PGs can be learned from such examples. 

Definition 5 . For any sets S and P, we call Qf the set of PGs G = {S,I,s) 
satisfying: Vv £ X, VX 1 ,X 2 £ I(v): / P pfi) = f P {X 2 ) =► X x = X 2 

Theorem 4 . The class Qf is learnable in Gold’s model from Feature-tagged Ex- 
amples (i.e. where, in Gold’s model, FT plays the role of C andV = ExAifiP)). 
Proof. The theorem is a corollary of Theorem 2 , where k and N can be computed 
from any sequence of Feature-tagged Examples that enumerates FT{G ): 

- the condition satisfied by a PG for being an element of Qf implies that the 
number of distinct multisets associated with the same word in Feature-tagged 
Examples is the same as the number of distinct categories associated to it by 
function I. So k can be easily obtained. 

- the width of a category is exactly the number of elements in the corresponding 
multiset, so N can also be easily obtained. 

Acquisition Algorithm. Our algorithm takes as input a set of Feature-tagged 
Examples for some G £ Qf and provides a set of PGs. We conjecture (although 
we haven’t proved yet) that the output is exactly, up to basic transformations, 
the set of every PGs compatible with the input. The algorithm has two steps: 
first variables are introduced, then constraints are deduced on their values. 

First Step: Variable Introduction. Although Feature-tagged Examples provide 
a lot of information, two things remain to be learned: the nature of the potential 
exponents of categories and their relative positions inside a concatenation. We 
introduce variables to code both problems. Variables for the exponents take their 
value in Z, those for the relative positions take their value in N\{ 0 }. 

Example f. The Feature-tagged Example of Example 3 gives: 
he: Xj = {(773,0:11)} 

loves: T 2 = {{s},x 21 ),(tt%' , x 22 ),(o v " ,x 23 )} 
her: T 3 = {(o’Gzsi)} 

with u, v, v', v" , w £ Z. Vi, j, Xij £ N\{ 0 } is the position of the j th primitive cat- 
egory of the i th word. The following constraints and consequences are available: 
{£11} = {1} => Xu = 1; {£21, £22, £23} = {1,2, 3 }; {£31} = {1} => £31 = 1 
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This coding allows to reformulate the learning problem into a variable as- 
signment problem. Furthermore, as the Feature-tagged Examples belong to the 
same FT(G) for some G in Qf, the same variables are used for every occurrence 
of the same couple (word, multiset) in the set of Feature-tagged Examples. 

Second Step: Constraints Deduction. This step consists in deducing con- 
straints applying on the variables. Each Example is treated one after the other. 
For a given Example, we call Tj the multiset associated with the i th word. Each 
initial sentence of n words is then replaced by a sequence of n multisets. Con- 
straint deduction takes the form of rules that mimic the rules I and F used for 
the parsing of PGs in section 2.2. Constraints coming from the same syntactic 
analysis are linked by a conjunction, constraints from distinct alternative syn- 
tactic analyses are linked by a disjunction. For each sentence, we thus obtain 
a disjunction of conjunctions of basic constraints (that we call data constraint) 
where each basic constraint consists in an exponent part and a position part. 

Let T m = {(Pmi,x mi )i<i<k} and T m > = {(Pm>j,x m 'j)i<j<k'} be two consecu- 
tive sets (at the beginning: m! = m+1). If 3(j)“ io ,i mio )eT m and 3 (p^ jo ,x m > jo ) 
G T m > such that p mio < p m ' jo or p m ' jo < p mio then: 

— Position constraints: 

7^ i'O^^mi G Tm- X mio 7 s Xmi 
^ J*0} G Tjn/ . X m f j 0 X m f j 

— Exponent constraints: 

• (all cases) v! = u + 1 

• IF p m fj 0 < p mio THEN: u is odd 

• IF p mio < p m f jo THEN: u is even 

— Next sets: 

• T m < T m — (Pmi„ > Xmio ) 

’ Tm' < Tm' — ( Pm ' j 0 i X m 'j 0 ) 

For internal reductions, where m = m' , the Position constraint is replaced 
by. Vf 7^ io,i 7^ Jo- x m i G Xmio or x r!l .j tt <C Xmi 

Whenever a set Tj becomes empty, drop it. The process ends when the list 
gets reduced to some {( p u ,x )} where p < s (the constraint u = 0 is deduced). 

If a primitive category satisfying the precondition of the rules has several 
occurrences in a multiset, any of them can be chosen (they are interchangeable). 
By convention, take the one associated with the position variable of smallest 
index. Example 6 (further) illustrates this case. To efficiently implement these 
rules, an interesting strategy consists in following the parsing steps of section 2.2. 

Example 5. Let us see what this algorithm gives on our basic Example 4 where 
the initial sequence of multisets is: T1T2T3: 

— (71-3,2:11) G Ti and (713 ,2:22) G T 2 satisfy the precondition. The position 

constraints obtained are: 2:22 < 2:21 and X22 < £23- The exponent constraint 

is: v' = M+l, and the remaining sets are the following: = 0, T2 = 

{(*?, 2:21), K",X23 )},T 3 = {K, 2531 )} 
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— then (o v , X23) € T2 and (o 1 ",X3i) € T3 satisfy the precondition. We deduce: 
X23 > X21, w = v" + 1 and T2 = {(s}, X21)}, T3 = 0. As Si < s, we obtain 
v = 0 and the algorithm stops. 

From the constraints: {£21, £22, £23} = {1,2,3}, £22 < £21, £22 < £23 and 
X23 > X21 we deduce: £21 = 2, £22 = 1 and £23 = 3. The PGs specified by these 
constraints are defined up to a translation on the exponents. If we set u = 0 = w, 
then v' = 1 (or v' = r) and v" = —1 (or v" = l ): the only remaining PG 
associates tt^sio 1 with “loves”. But, in general, solution PGs are only specified 
by a set of constraints. We will see that this set can be exponentially smaller 
than the set of classes (up to translations) of PGs it specifies. 

In an acquisition process, each example gives rise to a new data constraints 
that is conjoined to the previous ones. We get a convergence property as follows: 

Property 2 Let G € Qf, and FT(G ) = {ei}j S N, the data constraints VCi ob- 
tained from the successive e, converges 5 : 3no € N Vn > no : VC n +i = VC n 

Proof. At some stage N of the acquisition process, the set of primitive categories 
and the widths of the categories assigned to each word become known; after this 
N, we have only a finite number of possibilities for data constraints, that must 
therefore converge. 

Even if our acquisition algorithm finds every possible PG compatible with a 
set of Feature-tagged Example, it is not enough to make it a learning algorithm 
in the sense of Gold. A remaining problem is to identify a unique PG in the 
limit. Inclusion tests between Feature-tagged languages may be necessary for 
this purpose, and we do not even know if these tests are computable. They 
can nevertheless be performed for Feature-tagged Example of bounded length 
(this is Kanazawa’s strategy for learning k - valued AB-Categorial Grammars from 
strings) but, of course, make the algorithm intractable in practice. 



Why Pregroup Grammars and Constraints Are Efficient. The main 
weakness of known learning algorithms for Categorial Grammars is their algo- 
rithmic complexity. The only favourable case is when rigid AB-Grammars are to 
be learned from Structural Example but this situation is of limited interest. Our 
algorithm can still sometimes lead to combinatorial explosion but seems more 
tractable than previous approaches, as shown by the following two examples. 

Example 6 (First exponential gain). The first gain comes from the associativity 
of categories in PGs. Let a word “b” be associated with a category expecting 2 n 
arguments of a category associated with a word “a” , n of which are on its right 
and the other n on its left. The corresponding Feature-tagged Example is: 
a ... a b a ... a 
Is, e, ...,e} 

ti times 2 ti tiiries ^ 

5 we consider constraints written in a format without repetition 
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This case is equivalent with the problem of learning AB or Lambek Catego- 
rial Grammars from the following Typed Example [7, 9]: 
a ... a b a ... a 

e^_e (e, (e, ...(e, t))...) 

n times 2n times n times 

There are ( ”) different Categorial Grammars compatible with this input. This 
situation occurs with transitive verbs, whose category is T\(S/T ) or ( T\S)/T 
(both corresponding to the same type (e,(e,t)), i.e. the example with n = 1). 
The distinct categories assigned to “b”by each solution are deductible from one 
another in the Lambek calculus. Lambek Grammars are associative, but at the 
rule level , whereas PGs are associative at the category level. The only compatible 
PG (up to translations) is the one assigning e r . . .e\ s to b. 

n times n times 

Example 1 (Second exponential gain). Another exponential gain can be earned 
from the reduction of the learning problem to a constraints resolution problem. 
In the following example, position variables are shown under for readability: 
a b c cl 

{s,m,n} {to, n, o,p} {o,p,g,r} {q,r} 

{xn,Xi 2 ,X 13 } {x 2 1,X22,X23,X24} {x 3 l, X 32 , X 33 , X 3i } {x 4 l,X 4 2} 

There are several PGs compatible with this input, all of which sharing the same 
values for exponent variables, but differing in the way they embed the two reduc- 
tions to be applied on each distinct category (one solution -up to translations- 
is shown above, the other one is shown under). As each choice is independent, 
there are 2 3 = 8 different PGs compatible with this example but defined by a 
conjunction of 3 constraints (the first one is displayed on the right). 



SJel 



sm n n mo p p oq r r q 
a b c d 
snm 1 mn r po l op r rq l qr " 

■ ■ ' ■ u 1 1 u ' 



U 



((X 12 < Xi 3 )A(X22 <X2l))V((Xi3 <Xi2)A(x 2 i <£ 22 )) 



4 Conclusion 

Pregroup Grammars appear to be an interesting compromise between simplicity 
and expressivity. Their link with semantics is still an open question. As far 
as learnability is concerned, very few was known till now. This paper provides 
theoretical as well as practical approaches to the problem. Theoretical results 
prove that learning PGs is difficult unless limitations are known. The practical 
approach shows that the limitations can be weakened when rich input data is 
provided. These data take the form of Feature-tagged sentences which, although 
very informative, are arguably language-independent. The interest of working 
with constraints is that the solution grammars are only implicitly defined by the 
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output. The combinatorial explosion of solution grammars is then sometimes 
delayed to the constraint resolution mechanism, as displayed in the examples. 
As many current learning algorithms are unification-based [4,16], the use of 
constraints may also be seen as a natural generalization of such techniques. What 
remains to be done is to study further the properties of our algorithm, both from 
the point of view of tractability, and from the point of view of formal properties 
and to exploit further the good properties of bounded width grammars. 
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Abstract. We propose in this paper a novel approach to the induc- 
tion of the structure of Hidden Markov Models (HMMs). The notion of 
partially observable Markov models (POMMs) is introduced. POMMs 
form a particular case of HMMs where any state emits a single letter 
with probability one, but several states can emit the same letter. It is 
shown that any HMM can be represented by an equivalent POMM. The 
proposed induction algorithm aims at finding a POMM fitting a sample 
drawn from an unknown target POMM. The induced model is built to 
fit the dynamics of the target machine observed in the sample. A POMM 
is seen as a lumped process of a Markov chain and the induced POMM 
is constructed to best approximate the stationary distribution and the 
mean first passage times (MFPT) observed in the sample. The induc- 
tion relies on iterative state splitting from an initial maximum likelihood 
model. The transition probabilities of the updated model are found by 
solving an optimization problem to minimize the difference between the 
observed MFPT and their values computed in the induced model. 

Keywords: HMM topology induction, Partially observable Markov 
model, Mean first passage time, Lumped Markov process, State split- 
ting algorithm. 



1 Introduction 

Hidden Markov Models (HMMs) are widely used in many pattern recognition 
areas, including applications to speech recognition [15], biological sequence mod- 
eling [6], information extraction [7,8] and optical character recognition [11], to 
name a few. In most cases, the model structure, also referred to as topology, is 
defined according to some prior knowledge of the application domain. Automatic 
techniques for inducing the HMM topology are interesting as the structures are 
sometimes hard to define a priori or need to be tuned after some task adaptation. 
The work described here presents a new approach towards this objective. 

Probabilistic automata (PA) form an alternative representation class to 
model distributions over strings, for which several induction algorithms have 
been proposed. PA and HMMs actually form two families of equivalent models, 
according to whether or not final probabilities are included. In the former case, 
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the models generate distributions over words of finite length, while, in the later 
case, distributions are defined over complete finite prefix- free sets [5]. 

The equivalences between PA and HMMs can be used to apply induction 
algorithms in either formalism to model the same classes of string distributions. 
Nevertheless, previous works with HMMs mainly concentrated either on hand- 
built models ( e.g . [7]) or heuristics to refine predefined structures [8]. More prin- 
cipled approaches are the Bayesian merging technique due to Stolcke [18] and 
the maximum likelihood state-splitting method of Ostendorf and Singer [14]. 
The former approach however has been applied only to small problems while the 
later is specific to the subclass of left-to-riglrt HMMs modeling speech signals. 

In contrast, PA induction techniques are often formulated in theoretical learn- 
ing frameworks. These frameworks typically include adapted versions of the PAC 
model [16], Identification with probability one [1,2] or Bayesian learning [19]. 
Other approaches use error-correcting techniques [17] or statistical tests as a 
model fit induction bias [10]. All these approaches, while being interesting, are 
still somehow limited. From the theoretical viewpoint, PAC learnability is only 
feasible for restricted subclasses of PAs (see [5], for a review). The general PA 
class is identifiable with probability one [2] but this learning framework is weaker 
than the PAC model. In particular, it guarantees asymptotic convergence to a 
target model but does not bound the overall computational complexity of the 
learning process. From a practical viewpoint, several induction algorithms have 
been applied, typically to language modeling tasks [4, 3, 19, 12]. The experiments 
reported in these works show that automatically induced PA hardly outperform 
well smoothed discrete Markov chains (MC), also known as N-grams in this con- 
text. Hence even though HMMs and PA are more powerful than simple Markov 
chains, it is still unclear whether these models should be considered when no 
strong prior knowledge can help to define their structure. 

The present contribution describes a novel approach to the structural induc- 
tion of HMMs. The general objective is to induce the structure and to estimate 
the parameters of a HMM from a sample assumed to have been drawn from an 
unknown target HMM. The goal however is not the identification of the target 
model but the induction of a model sharing with the target the main features of 
the distribution it generates. We restrict here our attention to features that can 
be deduced from the sample. These features are closely related to fundamental 
quantities of a Markov process, namely the stationary distribution and mean 
first passage times. In other words, the induced model is built to fit the dynam- 
ics of the target machine observed in the sample, not necessarily to match its 
structure. 

We show in section 2 that any HMM can be converted into an equivalent 
Partially Observable Markov Model (POMM). Any state of a POMM emits 
with probability 1 a single letter, but several states can emit the same letter. 
Several properties of standard Markov chains are reviewed in section 3. The 
relation between a POMM and a lumped process in a Markov chain is detailed 
in section 4. This relation forms the basis of the induction algorithm presented 
in section 5. 
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2 Hidden Markov Models 

and Partially Observable Markov Models 

We recall in this section the classical definition of a HMM and we show that any 
HMM can be represented by an equivalent partially observable model. 

Definition 1 (HMM). A discrete Hidden Markov Model (HMM) (with state 
emission) is a 5-tuple M = {E,Q,A,B,l) where E is an alphabet, Q is a set of 
states, A : Q x Q — > [0, 1] is a mapping defining the probability of each transition, 
B : Q x E — * [0, 1] is a mapping defining the emission probability of each letter on 
each state, and i : Q — > [0, 1] is a mapping defining the initial probability of each 
state. The following stochasticity (or properness) constraints must be satisfied: 

v< 7 e Q, J2 q 'eQ a (.t q') = i; e Q, b (<l «) = b E, g q ^(<?) = i- 

Figure 1 presents a HMM defined as follows: 

2 = {a, b}, Q = { 1, 2}, t(l) = 0.4; i( 2) = 0.6; 

A(l, 1) = 0.1; A{\, 2) = 0.9; A(2, 1) = 

0.7; A(2, 2) = 0.3; 

B{l,a) = 0.2; B(l, b) = 0.8;H(2,a) = 

0.9; B(2, b) = 0.1 

Fig. 1. HMM example. 

Definition 2 (HMM path). Let M = (E, Q, A, B, l) be a HMM. A path in 
M is a word defined on Q* . For any path v, Vi denotes the i-th state of v, and 
\v\ denotes the path length. For any word u £ E* and any path v £ Q* , the 
probabilities Pm{u,v) and Pm(u) are defined as follows: 

{ ^l)]J l i Zl[ B ( u i: u i)A(l'i,l'i+l)\B(ui,Ul) if l = \u\ = \u\ > 0 , 
Pm(u , v) = < 1 if |u| = \v\ = 0 and 
( 0 otherwise. 

Pm(u ) = ^2 P(u,v). 

Pm(u , v) is the probability to emit word u while following path v. Pm (a ) can be 
interpreted as the probability of observing a finite word u as part of a random 
walk through the model. For instance, the probability of the word ab in the 
HMM of Fig. 1 is given by: P M (ab) = PM(ab , 11) + P M {ab , 12) + P M (ab , 21) + 
P M {ab , 22) = 0.0064 + 0.0072 + 0.3024 + 0.0162 = 0.3322. 

Definition 3 (POMM). A Partially Observable Markov Model (POMM) is a 
HMM M = ( E,Q,A,B,l ) with emission probabilities satisfying: \/q £ Q,3a £ 
E such that B{q , a) = 1. 

In other words, any state in a POMM emits a specific letter with probability 1. 
Hence we can consider that POMM states only emit a single letter. This model 
is called partially observable since, in general, several distinct states can emit the 
same letter. As for a HMM, the observation of a word emitted during a random 
walk does not allow to identify the states from which each letter was emitted. 
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However, the observations define state subsets from which each letter may have 
been emitted. Theorem 1 shows that the class of POMMs is equivalent to the 
class of HMMs, as any distribution generated by a HMM can be represented by 
a POMM. 

Theorem 1 (Equivalence between HMMs and POMMs). Let M = (E, 

Q, A, B , l) be a HMM, there exists an equivalent POMM M' — (E, Q' , A', B' , </). 

Proof. Let M’ be defined as follows. 

- Q' = QxE, 

- B'((q, a), x) = 1 if x = a, and 0 otherwise, 

- A’((q , a), (q\ b )) = B(q, b)A(q , q’), 

- t'((g,a)) = Y, q ^ Q H) B {q' ,a)A{q' ,q). 

It is easily shown that M' satishes the stoclrasticity constraints. Let u = u \ ... ui 
be a word of E* and let v = ((gi, u\) . . . ( qi,ui )) be a path in M' . We have: 

l - 1 

P M '(u,v) = i\{qi,ui)) ni B ' ((®, Ui), Ui)A'{{qi, Ui), (q i+1 ,Ui + i))\B'((qi,ui),ui) 

i = 1 

l-l 

= J2 ‘’W) B (q',Ui)A(q , ,q 1 )'[[[B(q i ,u i+1 )A(qi,q i+1 )] 
q’&Q i = 1 

= PM{u,q'qi...qi-i)A{qi_i,qi) 
q'eQ 

Summing up over all possible paths of length l = (u. in M' , we obtain: 

Pm'{u) = Pm'{u,u) = E„ ie Qi-i Y Jq '£QPM{u,q'v 1 )Y Jq ^Q A {q\v 1 \,q) 

= E„ 2 eQi p m{u, v 2 ) = P M (u) 

Hence, M and M' generate the same distribution. □ 

The proof of theorem 1 is adapted from a similar result showing the equiv- 
alence between PA without final probabilities and HMMs [5]. An immediate 
corollary of this theorem is the equivalence between PA and POMMs. Hence 
we call regular string distribution, any distribution generated by these models 1 . 
Figure 2 shows an HMM and its equivalent POMM. 

1 More precisely, these models generate distributions over complete finite prefix-free 
sets. A typical case is a distribution defined over E n , for some positive integer n. 
See [5] for further details. 
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Fig. 2. Transformation of a HMM into an equivalent POMM. 



It should be stressed that all transition probabilities of the form A' ((q, _), (q' , b)) 
are necessarily equal as the value of A'((q, a), (q' , b )) does not depend on a in a 
POMM constructed in this way. A state (q, a) in this model represents the state 
q reached during a random walk in the original HMM after having emitted the 
letter a on any state. 

3 Markov Chains, Stationary Distribution 
and Mean First Passage Times 

The notion of POMM introduced in section 2 is closely related to a standard 
Markov Chain (MC). Indeed, in the particular case where all states emit a dif- 
ferent letter, the process of a POMM is fully observable. Moreover the Markov 
property is satisfied as, by definition, the probability of any transition only de- 
pends on the current state. Some fundamental properties of a Markov chain 
are recalled in this section. The links between a POMM and a MC are further 
detailed in section 4. 

Definition 4 (Discrete Time Markov Chain). A discrete time Markov 
Chain (MC) is a stochastic process {X t } where the random variable X takes 
its value at any discrete time t in a countable set Q and such that: P[X t +i = 
q \X t ,Xt-i, ■ ■ ■ , .Xo] = P[Xt + 1 = q\X t \. This condition states that the probability 
of the next outcome only depends on the last value of the process. This is known 
as the (first-order) Markov property. When the set Q is finite the process forms 
a finite state MC. 

Definition 5 (Finite State MC Representation). A finite state represen- 
tation of a MC is a 3-tuple T = (Q,A,l) where Q is a finite set of states, 
A = Q x Q — > [0, 1] is a mapping defining the transition probability function and 

4 : Q — ► [0, 1] is the initial probability of each state. The following stochasticity 
constraints must be satisfied: 'fZqeQ l {q) = b' Vg G Q, Ylq'^Q A(q,q') = 1 

In this context, the Markov property simply states that the probability of 
reaching the next state only depends on the current state. For a finite MC, the 
transition probability function can be represented as a \ Q\ x \ Q\ transition matrix. 
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In the sequel, A both denotes this function and its matrix representation, with 
A qq i = A{q,q'). Similarly, the function l is associated with a |Q|— dimensional 
initial probability vector, with i q = t(g). We will use interchangeably MC to 
denote a finite Markov chain or its finite state representation. A finite MC can 
also be constructed from a HMM by ignoring the emission probabilities and the 
alphabet. We call this model the underlying MC of a HMM. 

Definition 6 (Underlying MC of a HMM). Given a HMM M = ( E,Q,A , 
B,l), the underlying Markov chain T is the 3-tuple (Q,A,l). 

Definition 7 (Random Walk String). Given a MC, T = (Q,A,i), a random 
walk string s can be defined on Q* as follows. A random walker is positioned on 
a state q according to the initial distribution t. The random walker next moves 
to some state q' according to the probability A(q,q r ). Repeating this operation n 
times results in a n- steps random walk. The string s is the sequence of states 
visited during this walk. 

In the present work, we focus on regular Markov chains. For such chains, there 
is a strictly positive probability to be in any state after n steps, no matter the 
starting state. 

Definition 8 (Regular MC). A MC with transition matrix A is regular if 
and only if for some n € N, the power matrix A has no zero entries. 

In other words, the transition graph of a regular MC is strongly connected 2 
and all states are aperiodic 3 . The stationary distribution and mean first passage 
times are fundamental quantities characterizing the dynamics of random walks 
in a regular MC. These quantities form the basis of the induction algorithm 
presented in section 5.2. 

Definition 9 (Stationary Distribution). Given a regular MC, T = ( Q,A,t ), 
the stationary distribution is a \Q\ — dimensional stochastic vector n such that 
7 tA = 7 r. 

This vector is also known as the equilibrium vector or steady-state vector. A 
regular MC is started at equilibrium when the initial distribution t is set to the 
stationary distribution 7 r. The g-th entry of the vector 7r can be interpreted as 
an expected proportion of the time the steady-state process reaches state q. 

Definition 10 (Mean First Passage Time). Given a regular MC, T = 
(Q, A, i), the first passage time is a function f = Q x Q N such that f(q, q') 
is the number of steps before reaching state q' for the first time, leaving initially 
from state q. 

f(q, q') = inf{t > 1 | X t = q' and X 0 = g} 

The Mean First Passage Time (MFPT) denotes the expectation of this function. 
It can be represented by the MFPT matrix M, with M qq i = E[f(q, q')\. 

2 The chain is said to be irreducible. 

3 A state i is aperiodic if Ary > 0 for all sufficiently large n. 
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For a regular MC, the MFPT values can be obtained by solving the following 
linear system [9]: 



Vq,q' G Q,M qql = 



1+ ^ A qq »M q » q > 
1 



n q 



, if q ± q' 

, otherwise. 



The values M qq are usually called recurrence times 4 . 



4 Relation Between Partially Observable Markov Models 
and Markov Chains 



Given a MC, a partition can be defined on its state set and the resulting process 
is said to be lumped. 

Definition 11 (Lumped Process). Given a regular MC, T = (Q,A,l), let q ^ 
be the state reached at time t during a random walk in T . n = {ki, K 2 , . . . , n r } 
denotes a partition of the set of states Q. K k = Q ^ 2® denotes a function that, 
given a state q, returns the block of k, or state subset, containing q. The lumped 
process T//k outcomes K K (q^) at time t. 

Consider for example the regular MC T\ illustrated 5 in Fig. 3. A partition n 
is defined on its states set, with ki = {1,3},K2 = {2} and K 3 = {4}. The ran- 
dom walk 312443 in T\ corresponds to the following observations in the lumped 
process T\//k: K 1 K 1 K 2 K 3 K 3 K 1 . 




Fig. 3. A regular Markov chain T\ and the partition k = {{1, 3}, {2}, {4}}. 



While the states are fully observable during a random walk in a MC, a lumped 
process is associated to random walks where only state subsets are observed. In 
this sense, the lumped process makes the MC only partially observable as it is 

4 An alternative definition, M qq = 0, is possible when it is not required to leave the 
initial state before reaching the destination state for the first time [13]. 

5 For the sake of clarity, the initial probability of each state is not depicted. Moreover, 
as we are mostly interested in MC being in steady-state mode, the initial distribution 
is assumed to be equal to the stationary distribution deriving from the transition 
matrix (see Def. 9). 
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the case for a POMM. Conversely, a random walk in a POMM can be considered 
as a lumped process of its underlying MC with respect to an observable partition 
of its state set. Each block of the observable partition corresponds to the state(s) 
emitting a specific letter. 

Definition 12 (Observable Partition). Given a POMM M = (E, Q , A , B 1 1 ), 

the observable partition k is defined as follows: £ Q,K K (q) = K K (q') O- 

3a £ E, B(q : a) = B(q ’ , a) = 1 

The underlying MC T of a POMM M has the same state set as M . Thus the 
observable partition k of M is also defined for the state set of T. If each block 
of this partition is labeled by the associated letter, M and T//k define the same 
string distribution. 

It is important to notice that the Markov property is not necessarily satisfied 
for a lumped process. For example, the lumped MC in Fig. 3 satisfies P[Xt + 2 = 
«2 | Xt + 1 = Ki,X t = k 2 ] = 0.2 and P[X t + 2 — k 2 | X t +i = n\,X t = K3] = 0.4, 
which clearly violates the first-order Markov property. In general, the Markov 
property is not satisfied when, for a fixed length history, it is impossible to 
decide unequivocally which state the process has reached in a given block while 
the next step probability differs for several states in this block. This can be 
the case no matter the length of the history considered. This is illustrated by 
the MC depicted in Fig. 4 and the partition k = {{1,2}, {3}}. Even if the 
complete history of the lumped process is given, there is no way to know the 
state reached in K\. Thus, the probability P[X t = k 2 \ X t -\ = Ki, X t _ 2 , . . . , A' 0 ] 
cannot be unequivocally determined and the lumped process is not markovian 
for any order. Hence the definition of lumpability. 




Fig. 4. A non markovian lumped process. Fig. 5. The MC T\ lumped with 

respect to the partition a! = 
{{ 1 , 2 }, { 3 , 4 }}. 

Definition 13 (Lumpability). A MC T is lumpable with respect to a partition 
k if the lumped process T // n satisfies the first-order Markov property for any 
initial distribution. 

When a MC T is lumpable with respect to a partition k, the lumped process 
T // k defines itself a Markov chain. 

Theorem 2 (Necessary and sufficient conditions for lumpability [9]). 

A MC is lumpable with respect to a partition k if and only if for every pair of 
blocks Ki and Kj the probability Aij//n to reach some state of kj is equal from 
every state in Ki: 
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V/t^, Kj (E K,Wq, q tz Ki, Aij j/ K ^ ' Aqq'f — ^ ' A q ' q" 

q"eKj q" £Kj 

The values Aij//K form the transition matrix of the lumped chain. For ex- 
ample, the MC T\ given in Fig. 3 is not lumpable with respect to the parti- 
tion k = {{1, 3}, {2}, {4}} while it is lumpable with respect to the partition 
k! = {{1,3}, {2,4}}. The lumped chain Ti//k' is illustrated in Fig. 5. 

Even though a lumped process is not necessarily markovian, it is useful for 
the induction algorithm presented in section 5.2 to define the mean first passage 
times between the blocks of a lumped process. To do so, it is convenient to 
introduce some notions from absorbing Markov chains. In a MC, a state q is 
said to be absorbing if there is a probability 1 to go from q to itself. In other 
words, once an absorbing state has been reached in a random walk, the process 
will stay on this state forever. A MC for which there is a probability 1 to end 
up in an absorbing state is called an absorbing MC. In such a model, the state 
set can be divided into the absorbing state set Qa and its complementary set, 
the transient state set Qt- The transition submatrix between transient states is 
denoted At- A related notion is the mean time to absorption. 

Definition 14 (Mean Time to Absorption). 

Given an absorbing MC, T — ({Qa, Qt}, A, l), the time to absorption is a 
function g = Qt — > N such that g(q) is the number of steps before absorption, 
leaving initially from a transient state q. 



g(q) = inf{t > 1 | X t € Qa, X 0 = q} 



The Mean Time to Absorption (MTA) denotes the expectation of this function. 
It can be represented by the vector z computed as z = (/ — At) -l, where 1 
denotes a \Qt\ — dimensional vector with each component being equal to 1. 

The q - th entry of z represents the mean time to absorption, leaving initially from 
the transient state q. 



Definition 15 (MFPT for a Lumped Process). Given a regular MC T = 
(Q,A,l), k a partition of Q and Ki, Kj two blocks of k, an absorbing MC is 
created from T by transforming every state of Kj to be absorbing. Furthermore, 
let z J be the MTA vector ofTT The mean first passage time Mjj//K from Ki to 
Kj in the lumped process T // k is defined as follows: 



Mij H k 



1 

T Ki 



if Ki = Kj 



E 

qtzKi 



‘A- Z j 

. ~q 



otherwise 



where ir q is the stationary distribution of state q inT and n Ki = J2qe Ki n q the 
stationary distribution of the block k j in the lumped process T//k. 

In a lumped process, states subsets are observed instead of the original states 
of the Markov chain. A related, but possibly different, process is obtained when 
the states of the original MC are merged to form a quotient Markov chain. 
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Definition 16 (Quotient MC). Given a MC T = (Q,A,t) and a partition 
k = {ki, K 2 , ■ ■ ■ , K r } on Q, the quotient T/n is a r-states MC with transition 
matrix A/k and initial vector I/k defined as follows: 

Aij/n = t te) 

q€Kiq'£K,j K * q&i^i 

where tt is the stationary distribution of T and ir Ki = 7r g . 

Note that for any regular MC T, the quotient T/k has always the Markov 
property while, as mentioned before, this is not necessarily the case for the 
lumped process T // k. The following theorem specifies under which condition the 
distributions generated by T/n and T // k are identical. 

Theorem 3. If a MC T is lumpable with respect to a partition k then T/n and 
T//k generate the same distribution in steady-state. 



Proof. When T is lumpable with respect to k, the transition probabilities be- 
tween any pair of blocks Ki, Kj are the same in both models: 



A i,j/ K = T, 



q£Ki 7T K 



Eo'eit, A qq' — A ij// K Y^ 



qtzKi 7 T k 



= A ijH K 



□ 



5 A Markovian Approach to the Induction 
of Regular Distributions 



As explained in section 4, a random walk in a POMM can be seen as a lumped 
process of its underlying MC lumped with respect to the observable partition. 
We present now an induction algorithm making use of this relation. Given a data 
sample, assumed to have been drawn from a target POMM TP, our induction 
algorithm estimates a model EP fitting the dynamics of the MC related to T P. 
The estimation relies on the stationary distribution and the mean first passage 
times which can be derived from the sample. In the present work, we focus 
on distributions that can be represented by POMMs without final probabilities 
and with regular underlying MC. Since the target process TP never stops, the 
sample is assumed to have been observed in steady-state. Furthermore, as the 
transition graph of TP is strongly connected, it is not restrictive to assume 
that the data is a unique finite string s resulting from a random walk through 
TP observed during a finite time 6 . Under these assumptions, all transitions of 
the target POMM and all letters of its alphabet will tend to be observed in the 
sample. Such a sample can be called structurally complete. The sample estimates 
are detailed in section 5.1 and an algorithm for POMMs induction is proposed 
in section 5.2. 



The statistics described in section 5.1 could equivalently be computed from repeated 
finite samples observed in steady-state. 
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5.1 Sample Estimates 

As the target process TP can be considered as a lumped process, each letter of 
the sample s is associated to a unique state subset of the observable partition k. 
All estimates introduced here are related to the state subsets of the target lumped 
process. First, we introduce the stationary maximum likelihood model. This 
model is the starting point of the induction algorithm presented in section 5.2. 

Definition 17 (Stationary Maximum Likelihood MC). Given a string s 
on an alphabet E, the stationary maximum likelihood MC ML = (Q,A,t) is 
defined as follows: Q = E; Va,b G Q,A a b = °^unt(a) > G <2> Lj = n a ; 
where count (a, b ) is the number of times the letter a is immediately followed by 
the letter b in s, count(a) = Y^bez: cormi(a, b) and ft is the stationary vector 
computed from A (see Def. 9). 

The ML model is a maximum likelihood estimate of the quotient MC TP/ k, 
where k is the observable partition. Furthermore the stationary distribution of 
T P/ k fits the letter distribution observed in the sample. The letter distribution 
is however not sufficient to reproduce the dynamics of the target machine. For 
instance, if the letters of s were alphabetical sorted, the stationary distribution 
of the AIL model would be unchanged. In order to better fit the target dynamics, 
the induced model is further required to comply with the MFPT between the 
blocks of TP// k, that is between the letters observed in the sample. 

Definition 18 (MFPT Matrix Estimate). Given a string s defined on an 
alphabet E, M is a |A| x |A| matrix where M a b is the average number of symbols 
after an occurrence of a in s to observe the first occurrence of b. 

5.2 Induction Algorithm 

Given a target POMM TP and a random walk string s built from TP, the 
objective of our induction algorithm is to construct a model fitting the stationary 
distribution and the MFPT estimated from the sample. This algorithm starts 
from the stationary maximum likelihood model ML, which complies with the 
stationary distribution. Iterative state splitting in the current model allows to 
increase the fit to the MFPT, while preserving the stationary distribution. The 
induction algorithm is sketched hereafter. 

At each iteration, a state q is selected by the function selectStateToSplit 
in an arbitrary order. During the call to split St ate, the state q is split into 
two new states q± and q 2 as depicted in Fig. 6. The input states i\,...,ik and 
output states o\, . . . ,oi are those directly connected to q in the current model 
in which all transitions probabilities A are known. Input and output states are 
not necessarily disjoint. 

The topology after splitting provides additional degrees of freedom in the 
transition probabilities. The new transitions probabilities x, y,z form the vari- 
ables of an optimization problem, which can be represented by the matrices 
X (k x 2),y (2 x l) and Z (2 x 2). The objective function to be minimized is 
W(X,Y, Z) = i(Mij — Mij//n) 2 . In other words, the goal is to find values 
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Algorithm MarkovianStateSplit 
Input: A string s resulting from a target POMM TP 
A precision parameter e 
Output: A POMM EP 

EP <— estimateML(s); // Build the ML model (see Def. 17) 

M <— sampleMFPT(s); // MFPT between the blocks ofTP//n (Def. 18) 

M//k<— blockMFPT(_EP); // MFPT between the blocks of EP (Def. 15) 

/ / Iterate till the MFPT of the current model are close enough to those estimated 
from s 

while (-Mjj - Mij//i t ) 2 > e do 

q <— selectStateToSplit (EP, M , AI//k); 

EP <— splitState(PP, q, M, M//k)\ // Update the current model 

M//k <— blockMFPT(PP); // Recompute MFPT between the blocks of EP 

return EP 





Fig. 6. Splitting of state q. 



for X , Y and Z such that the MFPT of the new model are as close as possible 
to those estimated from s. After the splitting of state q, the blockMFPT function 
recomputes the MFPT between the blocks of EP. The algorithm is iterated un- 
til the squared difference of the MFPT between TP// k and EP fall below the 
precision threshold e. 

Stochastic constraints have to be satisfied in order to keep a proper POMM. 
Moreover we require the stationary distribution to be preserved for any state 
q' yf q and n qi = ir q2 = . All these constraints can easily be formulated on the 

problem variables: 

Vj = 1, . . . , k : Xji > 0, Xj2 > 0, Xji + Xj 2 = A ijq - 

Vj = 1, ■ • ■ , l ■■ yij > 0, y 2 j > 0, yij + y 2j = 2 A q0j ; 

211, 212, 221, 222 > 0, Zu + Zl2 + 221 + 222 = 2 A qq \ 

211 + 212 + Ej=l Vi! = 1, 221 + 222 + J2j=l V*i = 1- 
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6 Conclusion and Future Work 

We propose in this paper a novel approach to the induction of HMMs. Firstly, 
the notion of partially observable Markov models (POMMs) is introduced. They 
form a particular case of HMMs where any state emits a single letter with proba- 
bility one, but several states can emit the same letter. It is shown that any HMM 
can be represented by an equivalent POMM. Our induction algorithm aims at 
finding a POMM fitting a sample drawn from an unknown target POMM. The 
induced model is built to fit the dynamics of the target machine observed in the 
sample, not necessarily to match its structure. To do so, a POMM is seen as a 
lumped process of a Markov chain and the induced POMM is built to fit the 
stationary distribution and the mean first passage times (MFPT) observed in 
the sample. 

Our ongoing work includes several issues. The selectStateToSplit func- 
tion defines the order in which states are selected for splitting in our induction 
algorithm. Among all candidate states for splitting, the one providing the largest 
decrease of the objective function after the split could be considered in the first 
place. A simple implementation would compute the values of the objective func- 
tion for all candidate states and would then select the best candidate. More 
efficient ways for computing this optimal state are under study. 

A general solver could be used to solve the optimization problem at each 
iteration of the MarkovianStateSplit algorithm. An efficient implementation 
of this optimization procedure is under development. 

A systematic experimental study of the proposed approach is our very next 
task. We will focus in particular on practical comparisons with standard prob- 
abilistic automata induction algorithms and EM estimation of HMMs using 
greedy approaches to refine predefined structures. Other perspectives include 
a formal study of the convergence of this approach as a function of the precision 
parameter e and extensions to models for which the underlying Markov chain is 
no longer assumed to be regular. 
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Abstract. A base problem in Web information extraction is to find ap- 
propriate queries for informative nodes in trees. We propose to learn 
queries for nodes in trees automatically from examples. We introduce 
node selecting tree transducer (NSTT) and show how to induce determin- 
istic NSTTs in polynomial time from completely annotated examples. We 
have implemented learning algorithms for NSTTs, started applying them 
to Web information extraction, and present first experimental results. 

Keywords: Web information extraction, tree automata and logics, 
grammatical inference. 



1 Introduction 

Web documents in HTML or XML form trees with nodes containing text. The tree 
structure is relevant for Web information extraction (IE) from well structured 
documents as created by databases. Many recent approaches to Web IE therefore 
focus on tree structure [5, 10, 14] rather than pure text [15,22]. 

A base problem in Web information extraction (IE) is to find appropriate 
queries for informative nodes in trees. In Fig. 1, for instance, one might want to 
extract all Email addresses (single-slot IE) . This can be done by querying for all 
links in the last line of each table, i.e., for all A-nodes whose TR-node ancestor 
is the last child of some TABLE-node. Alternatively, one might want to ask for 
all pairs of names and email addresses (multi-slots IE). This problem can be 
reduced to iterated single-slot IE: first search for all tables that encompass the 
pairs and then extract the components from the tables. 

Gottlob et. al. [10] advocate for monadic Datalog as representation language 
for node queries in trees. This logic programming language is highly expressive 
(all regular node queries in trees can be expressed) while enjoying efficient algo- 
rithms for answering queries. The Lixto system for multi-slot IE [1] supplies a 
graphical user interface by which to interactively specify and test node queries 
in monadic Datalog. 

Tree automata yield an alternative representation formalism [18, 21, 8, 3] that 
is particularly relevant for grammatical inference. Run-based node queries by 

* This research was partially supported by: “CPER 2000-2006, Contrat de Plan etat - 
region Nord/Pas-de-Calais: axe TACT, projet TIC”; fonds europeens FEDER “TIC 
- Fouille Intelligente de donnees - Traitement Intelligent des Connaissance” OBJ 
2-phasing out - 2001/3 - 4.1 - n 3. And by “ACI masse de donnees ACIMDD” 

G. Paliouras and Y. Sakakibara (Eds.): ICGI 2004, LNAI 3264, pp. 91-102, 2004. 

(c) Springer- Verlag Berlin Heidelberg 2004 
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Fig. 1 . A simple Web page, its corresponding tree. 



tree automata can be translated in linear time into monadic Datalog, and in 
non-elementary time back due to Thatcher and Wright’s famous theorem [23]. 
A more recent problem here is to deal with the unrankedness of trees [3] as in 
HTML and XML, by encoding into binary trees. 

The objective of the present paper is to learn tree automata that repre- 
sent node queries in trees from completely annotated examples. Query induction 
might be useful in order to circumvent manual query specification as in Lixto. 
Completely annotated examples consist of trees where all nodes are annotated by 
Booleans, stating whether a node is selected or not. We introduce node selecting 
tree transducers (NSTT) for representing node queries in trees, and show how to 
infer deterministic NSTTs from examples by variants of the RPNI algorithm [19]. 
We have implemented several versions of our learning algorithm and started to 
apply them to Web IE, so that we can present first experimental results. 

Related Work. Kosala et. al. [14] learn tree automata representing node 
queries in trees from less informative examples, which specify selected nodes 
but not unselected ones. This has the advantage that complete annotations for 
all nodes are not needed, but restricts the class of learnable queries to those rep- 
resentable by local or k-testable tree automata. Deterministic NSTTs, in contrast, 
can represent all regular node queries in trees (see subsequent work [2]). 

Node queries of bounded length in monadic-second order (MSO) logic over 
trees are shown PAC learnable in [11] . Variants of RPNI for inducing sub-sequential 
text transducers were proposed in [12]; these transducers may alter the structure 
of words in contrast to NSTTs which only relabel nodes in trees. Chidlovskii [4] 
proposes induction of word transducers for Web IE. 

2 Node Queries in Binary Trees 

Before considering the particularities of HTML and XML trees, we will deal with 
node queries for binary trees. We start from an finite alphabet S consisting of 
binary function symbols / and constants a. A binary tree t over A is a term that 
satisfies the grammar: 



t f(t 1 ,t 2 ) | a 
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For every tree t let nodes(t) C {1,2}* be the set of nodes of tree t. This set 
determines the shape t , i.e. , two trees have the same shape if and only if they 
have the same node sets. The empty word e always addresses the root of a tree; 
the first child of node v is node vl and its second child is v2. We write t(v) for 
the label of nodes v £ nodes(f). The size |f| is the cardinality of nodes(i). The 
tree t = /i(ai, / 2 (ai, 02 )), for instance, has the node set {s, 1, 2, 21, 22}. Its root 
is labeled by t(e) = f\ while its third leaf is labeled by t( 22) = 0 , 2 - 

Definition 1. A (monadic) node query in binary trees is a function q that 
associates to each binary tree t a set of nodes q{t) C nodes(t). 

The query leaf associates the set of leaves to a given tree, i.e., those nodes 
that don’t have children. If t = f (ai, / ( 01 , 02 )) then leaf(f) = {11,21,22}. 



3 Tree Automata 

We are interested in regular node queries in trees that can be defined equivalently 
by tree automata, in monadic Datalog, or MSO over trees. (See [3] for the case 
of unranked trees.) Here, we recall a representation formalisms for node queries 
based on successful runs of tree automata [21,8,3]. 

A tree automaton A over signature £ consists of a finite set states(A) of 
states, a set final(A) C states(A) of final states, and a finite set rules(A) of rules 
of the form f{pi,P 2 ) — * > p or a — > p where p,p\,p 2 £ states(A), / a binary 
function symbol and a a constant in £. The size \A\ of a tree automaton A is 
the number of its states plus the number of symbols occurring in its rules. 

A run r of a tree automaton A on a tree t is a binary tree over states(A) 
of the same shape, i.e. nodes(r) = nodes(t), such that for all v £ nodes(t) and 
/, a £ £: if t{v) = / then f(r(v 1), r(v 2)) — > r(v) £ rules(A), and if t( v) = a then 
a — > r(v) £ rules(A). A run r of A on t is successful if r(e) £ fi na 1(A) . 

Let Aq be an automaton with 3 states 0, 1,2, a single final state 2, and the 
rules: a — > 0, a — > 1, /( 0, 1) — > 2 . The tree /(a, a) has a unique successful run 
by Aq, the tree 2(0, 1); no other tree permits a successful run with A n . 

Definition 2. A tree automaton is (bottom-up) deterministic if no two rules 
have the same left-hand side and unambiguous if no tree permits two successful 
runs. 

Every deterministic tree automaton is clearly unambiguous, but not conversely. 
The automaton Aq above yields a counter example. It is nondeterministic given 
that tree a permits two distinct runs with Aq, but nevertheless unambiguous 
given that none of these two runs is successful. 

We write runsA(t) for the set of all runs of automaton A on tree t and 
succ_runs^(t) for the subset of all successful runs. A tree automaton recognizes 
all trees t that permit a successful run of A on t. The language L{A) of an 
automaton A contains all trees that A recognizes. 
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A pair of a tree automaton A over £ and a set of selection states P C 
states(A) defines a node query in trees t £ treej;: 

query AP (f) = {u | r £ succ_runsA(t), r(v) £ P} 

We call a monadic query q regular if it is equal to some query ^ P . Thatcher and 
Wright’s famous theorem [23] proves that a query is regular if and only if it can 
be defined by a MSO-formula over binary trees with a single free node variable. 



4 Node Selection Tree Transducer 

We will introduce NSTTs, a new representation formalism for regular node queries 
in trees suitable for grammatical inference. 

Every subset q(t) C nodes(£) can be identified with a tree over Bool with the 
same shape, such that the Boolean values q(t)(v) satisfy for all v £ nodes(f): 

q{t)(v) <-> v £ q(t) 

The query application leaf (/(a, <?(a, b)), for instance, yields the Boolean tree 

false(true, false(true, true)). 

Given two trees t over £ and (3 over Bool with the same shape, we define a tree 
t x (3 over £ x Bool by requiring for all v £ nodes(£) = nodes(/3) = nodes(f x (3): 

(t x 0)(v) = {t(v),/3(v)) 

A tree language L over £ x Bool is functional (resp. total ) if for all t £ treex; 
there exists at most (resp. at least) one tree [3 £ treeeooi such that t x f3 £ L. 
We associate tree languages lan g over £ x Bool to queries q in trees over £: 

lan g = {tx q(t) \ t £ treey;} 

Tree languages lan g are always functional and total. Conversely, we will associate 
queries to functional tree languages. Let total(L) be the functional total language 
containing L and contained in L U tree^ x /f a | se j.. Functional languages L define 
node queries in trees that satisfies for all trees t £ tree^ and (3 £ treeeooi: 

query L (f) = (3 iff t x (3 £ total(L) 

We have query,.^ = q for all node queries q in trees and lan queryj . = total(L) 
for all functional tree languages L. For every functional tree language L there 
exists exactly one query q with lan g = L, but for some queries q there exist 
many functional tree languages L with total(L) = lan g . This ambiguity has to 
be treated carefully. 

Definition 3. An NSTT is a tree automaton A whose language L(A) is func- 
tional. A NSTT-query has the form query L ^ where A is an NSTT. 

Proposition 1. NSTT- queries are regular. 
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Proof. Given a NSTT A we define a projection tree automaton tt{A) over E. We 
set states(7r(A)) = states(A) x Bool. final(7r(A)) = final(A) x Bool, and fix the 
following automata rules by: 

(f,b)(pi,P 2 ) rules(A) (a,b) (pi,p 2 ) G rules(A) 

f((pi,h),(p 2 ,fo)) -> (p,b) G rules(7r(A)) a -> (p,b) G rules(7r(A)) 

The subsequent Lemma 1 yields query i(A) = query T(j4) states(j4)x{true} . 

Lemma 1. r G runs^ x /?) iff r x (3 G runs^q)^). 

The converse of Prop. 1 holds as proved by the authors a follow up paper [2]. 
The construction in the proof shows NSTT-queries can be answered efficiently, in 
that query^q^f) can be computed in time 0(|A| * \t\) from NSTTs A and trees t. 
It is sufficient to transform NSTT-queries into automata queries (in linear time) 
which can be answered in linear time. We finally present a new polynomial time 
algorithm for testing whether a deterministic automaton A is an NSTT. 

Proposition 2. The language of a deterministic tree automaton A over E x 
Bool is functional if and only if the projection tt(A) to E is unambiguous. 

Proof. For the one direction, let A be deterministic (and thus unambiguous) 
and L(A) functional. Suppose rq x (3\,r 2 x /3 2 G succ.runs^q^i) for some t G 
treei;. Lemma 1 yields rq G succ_runsq(t x (3\) and r 2 G succ_runsq(f x (dff)- 
The functionality of A implies that t x (3\ = t x (3 2 and thus /?i = fo. The 
unambiguity of A yields rq = rq so that rq x /3i = rq x fa ■ This proves that 7 t(^ 4) 
is unambiguous. 

For the converse, assume that n(A) is unambiguous and suppose ixft,fx 
/?2 G L(A). Let rq G succ_runs^(t x (3\) and r 2 G succ_runsA(t x (iff)- Lemma 
1 yields rq x /?i,r 2 x /3 2 G succ_runs T ( J 4 )(t). The unambiguity of n(A) implies 
f3 1 = /? 2 , i.e. , L(A) is functional. 

Proposition 3. Unambiguity of tree automata can be tested in polynomial time. 

Proof. An algorithm for word automata can be found in [6]. Here, we test un- 
ambiguity for tree automata A. The algorithm is based on the binary relation 
drstA C states(A) x states(A), where drstA(p,p ; ) means that two distinct runs of 
A label the root of the same tree with respectively p and p' . 

drstA {p,p') iff 3 1 G treex'dr, r' G runsq(t). r r' A r(e) =p A r'(e) = p' 

The automaton A is ambiguous if and only if there exists p G final(A) such that 
drst A (p,p). It remains to compute the relation drstA- We assume that A does not 
contain useless states, which are not used in any runs. We compute the relation 
by applying the rules of Fig. 2 exhaustively. 

The first rule permits to derive runs for constant a-trees leading into distinct 
states. If we can apply the second rule, there are trees ti,t% with runs into 
states pi , P 2 respectively, since there are no useless states. The tree f(ti, t 2 ) then 
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drst A (p,p') 

(a — * p £ rules(A) A a — *■ p' € rules(A) A p ^ p') 

V (f(pi,P2) — > p £ rules(A) A /(pi,p2) — > p' £ rules(A) A p^p') 

V (/(pi,P2) -^p£ rules(A) A /(pi, pa) -+ p' £ rules(A) A drst A (pi,pi)) 

V (/(pi,P2) -^P£ rules(A) A /(pi,p 2 ) -> p' £ rules(A) A drst A (p 2 ,p 2 )) 

V (/(pi,P2) -^p£ rules(A) A /(pi, pi) — > p' £ rules(A) A drst A (pi,pi) A drst A (p 2 ,p 2 )) 

Fig. 2. Testing non-ambiguity. 



permits two distinct runs in pi and p^. The third rule is recursive. Suppose, 
there are distinct runs on ii leading to pi and pi and a run on t 2 into p 2 , then 
there exists distinct runs on /(fi,f 2 ) leading into p and p' respectively (even if 
p = p'). The forth and fifth rule are similar. 

The whole algorithm is polynomial in the size of A. We can apply at most 
states(A) 2 rules, when avoid to infer the same pair drstA(p,p 7 ) twice. Every rule 
application can be done in polynomial time in the size of the automaton. 

Corollary 1. Testing whether a deterministic tree automaton over £ x Bool is 
an NSTT can be done in polynomial time. 

Proof. For a deterministic NSTT A, we check the non-ambiguity of its projection 
7r(A) (Prop. 2). This can be done in polynomial time (Prop. 3). 

5 Inducing NSTT-Queries 

We show how to induce NSTT-queries for nodes in trees from completely anno- 
tated examples, i.e. , pair trees of the form t x (3. Let samples 3 be the set of all 
completely annotated examples. Every tree t x q(t) is a completely annotated 
example for q. Complete annotations thus specify for all nodes of a tree, whether 
they are selected or not. Given that completely annotated examples express pos- 
itive and negative information, it should not come as a surprise that we will learn 
by the RPNI algorithm for regular positive and negative inference [16, 19,20]. 

5.1 Identification in the Limit 

We recall the learning model of identification in the limit [9] and apply it to 
identification of node queries in trees. 

Definition 4. Let class and examples be sets related by a binary relation called 
consistency, and = an equivalence relation on class. Let samples be the set of all 
finite subset of examples. A sample in samples is consistent with a class member 
if all its examples are. 

Members of c lass are identifiable in the limit from examples if there are com- 
putable functions learner : samples — > class mapping samples to class members 
and char : class — > samples computing consistent samples for all class members 
- called characteristic samples - such that learner(S') = M for every member 
M £ class and sample S A char (M) consistent with M . 
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Identification of NSTT-queries from completely annotated examples is an in- 
stance of Def. 4 where class = {query L ( A j | A is an NSTT}, examples contains all 
completely annotated examples, and equivalence = of NSTT-queries is equality. 
An annotated example t x (3 is consistent with a query q if q(t) — /?. 

Identifying deterministic NSTTs from completely annotated samples is another 
case. Here class = {A \ A is an NSTT}, and an NSTT A is consistent with a 
completely annotated example if query L ( A j is. The equivalence A = A' is total 
language equality total(L(A)) = total(L(A 7 )). 

Identification of regular tree languages from positive and negative examples 
is the third case. Given some set T = Tz of trees, let class C 2 2 be a class of 
regular tree languages. A positive example is an element of T x {true}, a negative 
example an element of T x {false}. Let samples(T) be the set all samples with 
positive or a negative example. An example (i, b) is consistent with a language 
L £ class if t£L <t=> b. Equivalence = of tree languages is language equality. 

Identification of deterministic tree automata from positive and negative ex- 
amples is similar. Again, two automata are equivalent if they have the same 
language; an automaton is consistent with an examples if its language is. 

Proposition 4. Identification of NSTT-queries from completely annotated exam- 
ples can be reduced to identification of regular tree languages from positive and 
negative examples. 

Proof. Completely annotated examples t x (3 for a query q correspond posi- 
tive examples ( t x /3,true) for lan g and furthermore imply negative example 
(■ t x /?', false) for all (3 ' ^ (3. Let T = tree^xBooi- We define a function pn : 
samples 3 — > samples(T) by samples with completely annotated examples to sam- 
ples of positive and negative examples: 

pn (S) = {(tx (3',b)\tx (3 £ S,b$$ {(3 = (3')} 

Let class the class of regular tree languages over T. We assume that languages in 
class can be identified from positive and negative examples by functions learner : 
samples(T) — > class and char : class — > samples(T). 

Let class 7 be the set of NSTT-queries. We show that queries in class 7 can be 
identified from completely annotated example by the following functions: 

learner 7 : samples 3 — > class 7 , learner 7 },!?) = query| earner(pn(S ^ 

char 7 : class 7 — > samples 3 , char 7 }^) = {t x (3 £ Ian, j (t x f3',b) £ char(lan g )} 

First note that char 7 }^) is always consistent with q. If t x [3 £ char '(q) then by 
definition t x (3 £ lan g and thus q(t ) = (3. 

Second, notice that pn(char 7 (q)) D char(lan 9 ). For positive examples t x (3' £ 
char(lan g ), clearly t x j3' £ char 7 (g). For negative examples (t x /3', false) £ 
char(lan g ) there is some t x (3 £ lan 9 since lan g is total, with (3 (3' since 

lan 9 is functional and char(lan g ) consistent with lan g . Hence t x (3 £ char 7 (g) so 
that it x /?', false) e pn(char 7 (g)). 

Let q £ class and S char 7 (g) consistent with L. Then pn(S') D char(g) and 
pn(S') consistent with lan g . Identifying regular languages yields learner(pn(S')) = 
lan 9 , i.e. learner 7 },?) = query| an<? = q as required for identifying NSTT-queries. 
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Theorem 1. NSTT-queries for nodes in trees can be identified in the limit from 
completely annotated examples. 

Proof. The problem is equivalent to identifying regular tree languages from pos- 
itive and negative examples (Prop. 4) which can be done by RPNI [16, 19,20]. 

5.2 Identification in Polynomial Time and Data 

We now show how to identify deterministic NSTTs in polynomial time from com- 
pletely annotated examples. 

Identification in polynomial time and data is identification in the limit with 
functions learner in polynomial time and char in polynomial space [7]. 

The classical RPNI-algorithm identifies deterministic automata from positive 
and negative examples in polynomial time and data [19,16]. It applies to tree 
languages similarly as to word languages [20]. For a given signature, it computes 
a polynomial time function learner RP Ni mapping samples of positive and negative 
examples to deterministic tree automata, and polynomial space function char RPNI 
of the inverse type. 

Theorem 2. Let a completely annotated example t x f3 be consistent with a 
NSTT A if query L i A ^{t) = (3 , and two NSTTs be equivalent, if they define the same 
queries. Given this consistency and equivalence relations, we can identify deter- 
ministic NSTTs in polynomial time and data from completely annotated examples. 

Proof. For identifying deterministic NSTTs from completely annotated examples, 
we want to compute the function learner RPKI o pn which inputs completely anno- 
tated examples over E x Bool, transforms them into positive and negative exam- 
ples and then outputs the result of the RPNI learner for the signature E x Bool. 
This output is a NSTT whose associated language is total, which is the represen- 
tant of its equivalence class. 

Unfortunately, the function pn is not in polynomial space, given that the size 
of its output may be exponential. In order to solve this problem, we propose 
a more efficient implementation of the function learner RPKI o pn, the algorithm 
tRPNI, a variant of RPNI. 

Lets us recall the RPNI-algorithm [20, 19, 16]. RPNI inputs a sample of positive 
and negative examples. It first computes a deterministic automaton which rec- 
ognizes the set of positive examples in the sample. It then merges states exhaus- 
tively in some fixed order. A merging operation applies to the recent determin- 
istic automaton A and two states qi,q 2 G states(g) and returns a deterministic 
automaton det_merge(A, q\, q 2 ). A deterministic merge is a merge followed by 
recursive merges needed to preserve preserve determinism. For example, merg- 
ing q\ and <72 in an automaton with rules f(qi) — > 53 and f{qf) — > <74 requires 
merging q 3 with ( 74 . A merging operation is licensed only if det_merge(A, q\, < 72 ) is 
consistent with all negative examples in the sample. The main loop of RPNI thus 
performs at most quadratically many functionality tests and merging operations. 

The algorithm tRPNI behaves as RPNI except that it checks differently whether 
deterministic merging operation are licenced. It tests whether the language of 
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(a, true) — > 2 
(a, false) — ► 1 
(@, false) (2, 2) -» 3 
(@, false) (3, 1) — > 4 
(@, false) (2, 4) — ► 5 
Final states : {5} 



(«, o ) 5 

/ \ 

(a, l) 2 (@,0) 4 

/ \ 

(@,0) 3 (a, 0) 1 

/ \ 

(«. I ) 2 («. I ) 2 



(a, true) — > 2 
(a, false) — > 1 
(@, false) (2, 2) -> 1 
(@, false)(l, 1) -> 2 
Final states : {1} 



(@, o ) 1 

/ \ 

(a, l) 2 (©.O) 1 

/ \ 

(S.O) 1 (a, O) 1 

/ \ 

(a, l ) 2 (a, l ) 2 



Fig. 3. The initial automaton and its Fig. 4. The inferred NSTT and its run on 

run on the example tree. the example tree. 



det_merge(A, gi, 52 ) is functional. It thereby avoids to enumerate implicit nega- 
tive examples for functional language once and for all. And fortunately, we can 
check for functionality in polynomial time in the automaton size (Corollary 1). 

Conversely, we compute characteristic samples for NSTTs A as we did before 
for query i( ^ in the proof of Prop. 4: char(A) = {t x (3 £ total(L(A)) | (t x 
/3',b) £ char RPNI (A)}. Since lan queryi(A) = total(L(A)) it follows that all examples 
in char(A) are consistent with A or with ony other NSTT in its equivalence class. 

5.3 Example 

We illustrate how the learning algorithm works in a simplified case. We want to 
extract leaves that are on an odd level on trees on the alphabet £ = {@,a}. 

The first step of the algorithm is the construction of the initial automaton. 
This automaton, and its run on the input sample is indicate in Fig. 3. Merges 
are then being performed following the order of states. det_merge(A2, 1, 2) is 
rejected for lack of functionality. Following merge, det_merge(A3, 1, 3), is then 
accepted. Then det_merge(A4, 1, 4) is rejected. det_merge(A4, 2, 4) is accepted 
(which implies the merges of states 5 and 1 by propagation of the determinism) . 
This results in the automaton presented in Fig. 4. The example being well chosen, 
it appears that this NSTT performs the wanted annotation. 

6 Application to Web Information Extraction 

We have implemented the tRPNI algorithm and started applying it to Web IE. 
We report first results of this work in progress and discuss some of the problems 
that arise. 

6.1 Modeling HTML Trees 

Unranked trees. HTML or XML form unranked trees. We encode unranked trees into 
binary trees. We use the Currying inspired encoding of [3] in order to map to 
stepwise tree automata, rather than the more frequent first-child next-sibling 
encoding as in selection automata [8]. 

The unranked tree TABLE (TR(TD) ,TR(TD) ,TR(TD)), for instance, is trans- 
lated into the binary tree TABLE® (TR@TD)@(TR@TD)@(TR@TD) with a single bi- 
nary symbol Completely annotated examples for unranked trees are translated 
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into completely annotated examples on ranked encodings, too. Node annotations 
in unranked trees become leaf annotations in binary encodings. Inner @-nodes in 
binary encodings correspond to edges in unranked trees. As we do not wish to 
select edges, we label all 0-nodes by false. 

Infinite alphabets. Leaves of HTML trees may contain arbitrary texts. So far how- 
ever, we assumed finite alphabets. In our experiments, we ignore leaf contents 
completely, by abstracting all texts into a single symbol. 

Node attributes. Node attributes are currently ignored. Each HTML tag is ab- 
stracted into its corresponding symbol, whatever its attributes are. 

6.2 Approaching Problems 

A nice feature of RPNI is that does never do wrong generalizations when applied 
to a characteristic sample. In practice, however, we only dispose small samples 
that are seldom characteristic. 

Wrong generalizations. Lacking negative information is particularly embarrass- 
ing, as it leads to wrong generalizations. This may have the consequence that 
parts of documents cannot be recognized by inferred NSTTs, so that NSTT-queries 
do not select any nodes from such documents. We have designed two heuristics 
to deal with that problem. 

Typed merging. We use typing as inspired from [13] and forbid states with the 
different types to be merged. So far, we experiment with a fairly basic typing sys- 
tem: leaves of binary encodings are typed by their corresponding HTML tag, while 
inner nodes inherit the type of their first child. Types in annotated examples be- 
come types in the initial automaton. This prevents many wrong generalization, 
while allowing most of the meaningful ones. Our typing reflect the structure 
of encodings of unranked trees and that we do not want to merge nodes with 
different HTML tags. 

Wild-card interpretation. We relax the querying interpretation of inferred NSTTs. 
Consider a tree t\@t 2 - If our deterministic NSTT does not have any run on t 2 , 
but a run for t\ leading into state q\ and if there exists a single rule of the form 
qi@q 2 — > q then we permit relaxed runs of t\@t 2 into q that labels all nodes in 
f 2 by wild-cards. Nodes labeled by wild-cards are never selected. 

6.3 Experiments 

We have experimented with Okra and Bigbook tasks from the RISE bench- 
marks(www . isi . edu/ inf o-agents/RISE). Both Okra and Bigbook are computer 
generated web pages which represent sets of personal informations. The extrac- 
tion task is to extract e-mail addresses. Results are given Fig. 5. In Okra, one 
example is enough to achieve good results. Without wild-cards, performances 
are weaker in Bigbook, even though the task does not seem more complex than 
with Okra. Bigbook illustrates the problem of wrong generalizations. At the end 
of each document, every letter of the alphabet is indicated, either as a link or 
as standard text. With few examples, tRPNI fails to infer a NSTT that recognizes 
this part of the document, which is totally irrelevant to the querying task. At the 
same time, other relevant part of the document are properly labelled. Permitting 
wild-card interpretations helps in this case. 
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Okra 


# of Examples 


Accuracy 


size of initial 


size of inferred 


Learning time 






NSTT 


NSTT 




1 


100 % 


72 


24 


1.02 s 


2 


100 % 


82 


24 


1.24 s 


3 


100 % 


85 


24 


1.34 s 


Bigbook 


1 


76% / 100 %* 


162 


37 


4.14 s 


2 


85% / 100 %* 


172 


42 


7.12 s 


3 


100 % 


179 


48 


9.52 s 



Fig. 5. Experimental results on Okra and Bigbook benchmarks. (* indicates results 
without and with the use of wild-cards). 



Experiments with more complex tasks so far often yield poor results, because 
of lack of informations in pure structure and wrong generalizations. 

6.4 Possible Improvements 

Better text abstraction. The conversion of HTML node information into symbols 
could be improved using text-based information extraction techniques. For in- 
stance, instead of using one generic symbol for leaves, one could classify leafs in 
several clusters such as names, dates, numbers, etc. 

Better merging orders. A crucial parameter of RPNI (and tRPNl) is the order 
in which state merges are performed. The technique of evidence driven state 
merging [17] could be applied here. The distance between nodes in the encoded 
unranked trees should be taken into account in this order. 

7 Conclusion 

We have presented node selecting tree transducer to represent node queries in 
trees. We have proposed a variant of RPNI that can identify NSTTs in polynomial 
time from annotated examples. We have started applying our learning algorithm 
to Web information extraction and could report first encouraging results. 

In follow up work [3], we have shown that all regular queries in trees can 
be represented by NSTTs. On open theoretical question is, whether deterministic 
NSTTs with a total languages can be identified in polynomial time. 

In future work, we plan to continue improving our learning algorithms in 
practice. The open challenge remains, to built feasible and reliable learning based 
systems for Web information extraction. 
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Abstract. The present work studies clustering from an abstract point of view and 
investigates its properties in the framework of inductive inference. Any class S consid- 
ered is given by a numbering Ao, A i, . . . of nonempty subsets of N or Q k which is also 
used as a hypothesis space. A clustering task is a finite and nonempty set of indices 
of pairwise disjoint sets. The class S is said to be clusterable if there is an algorithm 
which, for every clustering task I, converges in the limit on any text for U ieiAi to a 
finite set J of indices of pairwise disjoint clusters such that U jejAj = U*g/Aj. A class 
is called semiclusterable if there is such an algorithm which Ends a J with the last 
condition relaxed to U jejAj D U ig/Aj. 

The relationship between natural topological properties and clusterability is investi- 
gated. Topological properties can provide sufficient or necessary conditions for clus- 
terability but they cannot characterize it. On one hand, many interesting conditions 
make use of both the topological structure of the class and a well-chosen numbering. 
On the other hand, the clusterability of a class does not depend on the decision which 
numbering of the class is used as a hypothesis space for the clusterer. 

These ideas are demonstrated in the context of geometrically defined classes. Clustering 
of many of these classes requires besides the text for the clustering task some additional 
information: the class of convex hulls of finitely many points in a rational vector space 
can be clustered with the number of clusters as additional information. Similar studies 
are carried out for polygons with and without holes. 

Furthermore, the power of oracles is investigated. The Turing degrees of maximal or- 
acles which permit to solve all computationally intractable aspects of clustering are 
determined. It is shown that some oracles are trivial in the sense that they do not 
provide any useful information for clustering at all. Some topologically difficult classes 
cannot be clustered with the help of any oracle. 
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1 Introduction 

The purpose of the paper is to study the role of computation and topology in 
the clustering process. To this aim, the following topics are investigated in an 
abstract model of clustering: 

1. necessary or sufficient topological conditions for clustering; 

2. various relationships between clustering, learning and hypothesis spaces; 

3. clusterability of many natural classes of geometrically defined objects; 

4. oracles as a method to distinguish between topological and computational 

aspects of clustering. 

Clustering has been widely studied in several forms in the fields of machine 
learning and statistics [2,5,15,17]. However abstract treatments of the topic 
are rare. Kleinberg [13] provides an axiomatic approach to clustering, but, in 
his settings, computability per se is not an issue. In contrast, the present work 
investigates clustering from the perspective of Gold style learning theory [9, 11] 
where limitations also stem from uncomputable phenomena. 

The basic setting is that a class of potential clusters is given. This class is 
recursively enumerable. A finite set / of (indices of) pairwise disjoint clusters 
from the given class is called a clustering task. Given such a task, the clusterer 
which might be any algorithmic device - receives a text containing all the data 
occurring in these clusters and is supposed to find in the limit a set J of (indices 
of) pairwise disjoint clusters which cover all the data to be seen. There are two 
variants with respect to a third condition: if one requires that the union of the 
clusters given by I is the same as the union of the clusters given by J, then one 
refers to this problem as clustering ; if this condition is omitted, then one refers 
to this problem as semiclustering. 

Clustering is in some cases more desirable than semiclustering: for example 
the clustering tasks from the class S conv ^k defined in Definition 8.1 are collections 
of convex sets having a positive distance from each other. The solution to such 
a clustering task is unique since each of these sets corresponds to a cluster. A 
clusterer has to identify these sets while a semiclusterer can just converge to the 
convex hull of all data to be seen. Such a solution is legitimate for semiclustering 
since it is again a member of the class Sconv.fc- But it fails to meet the intuition 
behind clustering since it does not distinguish the data from the various clearly 
different clusters. 

Note that in the process of clustering, it is sufficient to find the set J of 
indices mentioned above. From this J one can find for every data-item x in the 
set U jejAj of all permitted data the unique cluster where x belongs to. One just 
enumerates the sets with the indices in J until the data-item appears in one of 
them and then uses the index of this set as a description for the cluster to which 
this data-item belongs. So, from a recursion-theoretic point of view, finding the 
set J is the relevant part of a given clustering problem. 

For every indexed class of recursively enumerable sets there is a canonical 
translation from these indices to type-0 grammars in the Chomsky hierarchy 
which generate the corresponding sets. This links the current setting of clustering 
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to grammatical inference although there is no need herein to exploit the detailed 
structure of the grammars obtained by such a translation. 

1. A class has the Finite Containment Property iff any finite union of its members 
contains only finitely many other members. In Section 5 it is shown that classes 
satisfying this natural property separate the basic notions of clusterability, semi- 
clusterability and learnability. There is no purely topological characterization of 
clusterable classes: if a class contains an infinite set C and all singleton sets dis- 
joint from C then the class is clusterable iff C is recursive. Proposition 6.1 gives a 
further characterization which depends on the numbering: a class of disjoint sets 
is clusterable iff it has a numbering where every set occurs only finitely often. 
Section 6 provides some further sufficient criteria for clusterability which take 
into account topological aspects as well as properties of the given hypothesis 
space. These criteria are refinements of the Finite Containment Property. 

2. Clusterable classes are learnable but not vice versa. Although clusterable clas- 
ses are by definition uniformly recursively enumerable, the set of clustering tasks 
might fail to be. Proposition 3.2 shows that a class that can be clustered using a 
class comprising hypothesis space, that is a hypothesis space which enumerates 
the members of a superclass, can be clustered using a hypothesis space which 
enumerates the members of the class only. But by Example 3.3 a clusterable 
class might not be clusterable with respect to some class comprising hypothesis 
space. 

3. In Sections 7 and 8 it is demonstrated how one can map down concrete 
examples into this general framework. These concrete examples are geometrically 
defined subsets of Q k : affine sets, classes of sets with distinct accumulation points 
and convex hulls of finite sets. This third example is not clusterable but it turns 
out to be clusterable if some additional information about the task given to the 
clusterer is revealed. While there are several natural candidates for the additional 
information in the case of convex hulls of finite sets, this approach becomes much 
more difficult when dealing with clusters of other shapes. In the case of polygons 
in the 2-dimensional space, the additional information provided can consist of the 
number of clusters plus the overall number of vertices in the polygons considered. 
Still this additional information is insufficient for clustering classes of geometrical 
objects some of which have holes. But the /c-dimensional area is a sufficient 
additional information as long as one rules out that the symmetric difference of 
two clusters has the fc-dimensional area 0. 

4. Oracles are a way to distinguish between topological and computational dif- 
ficulty of a clustering problem. In Section 4 the relationship between an oracle 
E and the classes clusterable relative to E is investigated. For example, every 
1-generic oracle E which is Turing reducible to the halting problem is trivial: 
every class which is clusterable relative to E is already clusterable without any 
oracle. On the other hand, some classes are even not clusterable relative to any 
oracle. Proposition 4.2 characterizes the maximal oracles which permit to cluster 
any class which is clusterable relative to some oracle; in particular it is shown 
that such oracles exist. 
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2 The Basic Model 

Most of the notation follows [11, 16]. The next paragraph summarizes the most 
important notions used in the present work. 

Basic Notation 2.1. A class S is assumed to consist of recursively enumerable 
subsets of a countable underlying set U where in Sections 3-6, U is the set of 
natural numbers N and in Sections 7 and 8, U is a rational vector space of 
finite positive dimension. Mostly, S is even required to be uniformly recursively 
enumerable which means that there is a sequence Aq, Ai , . . . of subsets of U such 
that S = {Ao, Ai , . . .} and {(i,:r) € N x U : x £ A;} is recursively enumerable. 
Such a sequence Ao, Ai, . . . is called a numbering for S. 

The letters /, J, H always range over finite subsets of N. Define A/ as U, e / A,;. 
Let disj(S') contain all finite sets I such that A, fl Aj = 0 for all different i.j £ I. 
The sets in disj(S') are called clustering tasks. 

For any set A let |A| be the cardinality of A. Let A* be the set of all finite 
sequences of members of A and | cx | be the length of a string o £ A*. 

A text for a nonempty set A C U is any infinite sequence containing all 
elements but no nonelements of A. Clusterers and semiclusterers are recursive 
functions from U* to finite subsets of N, learners are recursive functions from U* 
to N. 

The sequence Wo, W\ , . . . denotes an acceptable numbering of all recursively 
enumerable sets and W e can be interpreted as the domain of the e-th partial- 
recursive function ip e . The set K = {e : e £ W e } is called the halting problem 
and this notion can be generalized to computation relative to oracles: A' is the 
halting problem relative to A; in particular K' is the halting problem relative 
to K and K" the one relative to K' . For more information on iterated halting 
problems see [16, page 450]. 

Definition 2.2. A class S = {A 0 , A \, . . .} of clusters is called clusterable iff there 
is a clusterer AI which, for every I £ disj(S'), converges on every text for A/ to 
a J £ disj(S') with Aj = Aj. Such an A I is called a clusterer for S. 

S is called semiclusterable if one replaces Aj = A/ by the weaker condition 
that Aj A Aj. 

S is called learnable in the limit from positive data with respect to the hypo- 
thesis space A 0 , Ai, ... iff there is a learner AI which for every igN converges on 
every text for Aj to a j gN with Aj = Aj. In the following “fearnaWe” stands for 
“learnable in the limit from positive data with respect to the hypothesis space 
A 0 , Aj, . . .”. 

Remark 2.3. A clusterer M for S = {Ao,Ai,...} might also use a different 
hypothesis space instead of the default one. Here a numbering B 0 ,Bi,... is 
called the hypothesis space of AI iff for every clustering task I and any text 
for Aj, AI converges on this text to a finite set J such that Bj = Aj and 
Bj fl Bj = 0 for all different i,j £ J. The hypothesis space is class preserving 
if S = {Bq,Bj, . . .} and class comprising if S C {Bq,Bi, . . .}. Nevertheless, in 
light of Proposition 3.2, it is assumed that a clusterer uses the default numbering 
A 0 , Ai, . . . as its hypothesis space unless explicitly stated otherwise. 
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3 Numberings and Clustering 

The main topic of this section is to investigate the role of numberings in cluster- 
ing. A natural question is whether clustering is independent of the numbering 
chosen as the hypothesis space. Another important issue is the relationship be- 
tween numberings of the class of clusters and numberings of the class of finite 
disjoint unions of clusters. The latter, which represents the clustering tasks, 
might not have a numbering despite of the fact that the former does, as shown 
in the next example. The class of sets representing the clustering tasks in this 
example cannot be made recursively enumerable by changing the numbering of 
the class of clusters. 

Example 3.1. Let A 0 = {0} and let, for every i € N and j € {1, 2}, 

a _/{2i+j} ifiiK; 

2l+J \ {0, 2i + j} ifieK. 

The class S = {Ao,A l7 ...} is uniformly recursively enumerable but the class 
{Aj : I £ disj(S)} is not since 

i K ^ (31 £ disj(S)) [{2 i + 1, 2 i + 2} C A/]. 

This connection holds for all numberings of S but fails for any numbering of the 
superclass of all finite sets. 

Thus there are clusterable classes where the corresponding class of all clustering 
tasks does not have a numbering. Nevertheless, a fundamental result of de Jongh 
and Kanazawa [4] carries over to clustering: whenever a class is clusterable with 
respect to a class comprising hypothesis space then the class is also clusterable 
with respect to every class preserving hypothesis space. An application of the 
next result is that every uniformly recursively enumerable class consisting only 
of finite sets is clusterable. 

Proposition 3.2. Let Aq,A\,... be a numbering of a class S and Bq, B\, . . . be 
another numbering (of a possibly different class) such that for every I € disj(S) 
there is a J with Ai = Bj. If there is a clusterer for S using the hypothesis 
space Bq,Bi,. . . then there is another clusterer that uses the original numbering 
Aq, A\, . . . as its hypothesis space. 

Example 3.3. The converse of Proposition 3.2 does not hold: there is a clus- 
terable class S and a numbering of a superclass of S such that no clusterer for 
S can use this numbering as a hypothesis space. 

Although there are classes S = {A 0 ,Ai,...} such that {A/ : I £ disj(S')} is 
not uniformly recursively enumerable, the superclass {A/ : I C N A |/| is finite} 
is uniformly recursively enumerable. A clusterer is at the same time a learner 
for S using the hypothesis space given by the numbering Bq,Bi,... which sat- 
isfies i?norm(/)-i = A/ for all nonempty sets I. But learnability of uniformly 
recursively enumerable classes does not depend on the hypothesis space; follow- 
ing a result of de Jongh and Kanazawa [4] there is also a learner for S which 
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uses Aq, Ai, .. . as its hypothesis space. So every clusterable class is learnable 
although the converse direction does not hold. 

Property 3.4. Every clusterable class is learnable. 

Examples 3.5. (a) The class S go id consisting of N and all its finite subsets is 
neither learnable nor clusterable. But S go id is semiclusterable. 

(b) The class S s i ng consisting of all singletons and the set N is learnable and 
semiclusterable but not clusterable. 

(c) Let C be infinite and recursively enumerable. The class Sc consisting of 
C and all singletons disjoint from C is learnable. Furthermore, Sc is clusterable 
iff Sc is semiclusterable iff C is recursive. 

The classes -Ssing and Sc where C is nonrecursive are learnable but not cluster- 
able. Both have the property that they are not closed under disjoint union. The 
next result shows that this property is essential for getting examples which are 
learnable but not clusterable. 

Property 3.6. Let a class S be closed under disjoint union, that is, A U B G S 
for all disjoint A, B £ S. Then S is clusterable iff S is learnable. 

A learner M for a class S is called prudent if it only outputs indices of sets it 
learns. One can enumerate all possible hypotheses eo, e\, . . . of M and so obtain 
a numbering Bq,B\, . . . with Bi = W ei of a learnable superclass of S. Fulk [8] 
showed that every learnable class has a prudent learner. Therefore, it is sufficient 
to consider only uniformly recursively enumerable classes for learning. So Fulk’s 
result can be stated as follows. 

Property 3.7 [11, Proposition 5.20]. Every learnable class has a prudent learner. 
In particular, every learnable class is contained in some learnable and uniformly 
recursively enumerable class. 

So every learnable class can be extended to one which is learnable and uniformly 
recursively enumerable. But in contrast to learning in the limit, this requirement 
turns out to be restrictive for clustering. Most interesting results are based on 
Definition 2.2 with the consequence that only countably many classes are cluster- 
able. The more general notion below expands the collection of clusterable classes 
to an uncountable one. Although the latter collection contains many irregular 
classes of limited interest, it still gives some fundamental insights. In this case 
one uses the acceptable numbering Wo, W \, ... of all recursively enumerable sets 
as the hypothesis space for the clusterer. 

Definition 3.8. A class S of recursively enumerable sets is clusterable in the 
general sense iff there is a machine M which converges on every text for the 
union of finitely many disjoint sets Lo, L \, . . . , L n £ S to a finite set J of indices 
of pairwise disjoint members of S such that L 0 U L\ U . . . U L n = U e6 jW e . 

Proposition 3.9. Let F be a {0,1 } -valued function which is not computable 
relative to the oracle K" . For all x, y £ N and z £ {0,1} let A XtZ = {( x,u,z ) : 
u £ N} and B x y = {(&, y, 0), ( x,y , 1)}. Then the class S containing all sets A x>z 
and B XtV with x,y £ N and z = F(x ) is clusterable in the general sense but not 
contained in any clusterable class which is uniformly recursively enumerable. 
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4 Clustering and Oracles 

Oracles are a method to measure the complexity of a problem. Some classes are 
clusterable with a suitable oracle while others cannot be clustered with any ora- 
cle. So the use of oracles permits to distinguish between problems caused by the 
computational difficulty of the class involved from those which are unclusterable 
for topological reasons. This is illustrated in the following remark. 

Remark 4.1. Recall the classes Sc and S go id from Examples 3.5. The class Sc 
is clusterable iff the set C in its definition is recursive. It is easy to see that 
supplying C as an oracle to the clusterer resolves all computational problems in 
the case that C is not recursive. But the class Sgoid is unclusterable because of 
its topological structure and remains unclusterable relative to every oracle. 

Oracles have been extensively studied in the context of inductive inference [1, 
6, 12, 14]. Call an oracle E maximal for clustering if every uniformly recursively 
enumerable class which is clusterable relative to some oracle is already clusterable 
relative to E. Call an oracle E trivial for clustering if every uniformly recursively 
enumerable class which is clusterable relative to E is already clusterable without 
any oracle. 

The next result shows that in contrast to the case of clustering in the general 
sense there are maximal oracles for clustering. It turns out that for an oracle E 
below K the following three conditions are equivalent: E is trivial for clustering, 
E is trivial for learning sets, E is trivial for learning functions; see [ 6 ] for the 
equivalence of the last two statements. 

Proposition 4.2. For every oracle E the following statements are equivalent: 

1. E > T K and E' > T K" ; 

2. the oracle E is maximal for learning from positive data - every uniformly re- 
cursively enumerable class is either not learnable with any oracle or learnable 
with oracle E; 

3. the oracle E is maximal for clustering - every uniformly recursively enu- 
merable class is either not clusterable with any oracle or clusterable with 
oracle E. 

An oracle G is A;-generic if for every E^.-set T of strings there is a prefix 77 A G 
such that either 77 G T or if T for all tf F 77 . There are 1-generic sets but no 
2-generic sets below K. Nevertheless, fc-generic sets exist for all k £ {1, 2, . . .}. 

Proposition 4.3. Let E be a nonrecursive oracle with E <t K . 

1. If E has 1-generic degree then E is trivial and permits only to cluster classes 
which can already be clustered without any oracle. 

2. If E does not have 1-generic degree then there is a uniformly recursively 
enumerable class which can be clustered using the oracle E but not without 
any oracle. 

The same characterizations hold for learning in place of clustering. 

The following example shows that trivial oracles can also be incomparable to I\. 
Example 4.4. Every 2-generic oracle is trivial for clustering. 
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5 The Finite Containment Property 

The main topic of this section is to investigate the relationship between the 
topological structure of the class S and the question whether S is clusterable. 
Recall that the classes S go id and S s [ ng are not clusterable for topological reasons: 
they contain a cluster which is the disjoint infinite union of some other clusters. 
So one might impose the following natural condition in order to overcome this 
problem. 

Definition 5.1. A class S = {Aq, Ai, . . .} has the Finite Containment Property 
if every finite union of clusters contains only finitely many clusters. That is, for 
all i there are only finitely many sets B £ S with B C A{ 0 ,i. 

Note that the Finite Containment Property is not necessary for clusterability. 
The class { { i, i + 1, . . .} : * € N} is learnable and clusterable but does not satisfy 
the Finite Containment Property. 

It is easy to see that the Finite Containment Property implies Angluin’s 
condition: for every set A/ there are only finitely many sets Aj with Aj C Aj. If 
one takes I? to be a set which contains for each Aj with Aj C A/ the minimum 
of Aj — Aj, then D is finite and there is no Aj left with D C Aj C Aj. 

Property 5.2. If S = {Ao,Ai,...} has the Finite Containment Property then 
S is clusterable relative to every oracle E with E >t K and E' >t K" . 

Although the Finite Containment Property guarantees clusterability from the 
topological point of view, it fails to guarantee clusterability from the recursion- 
theoretic point of view. Instead there are classes satisfying the Finite Contain- 
ment Property which are clusterable only with the help of a maximal oracle. 

Property 5.3. There is a class satisfying the Finite Containment Property 
which is learnable but neither clusterable nor semiclusterable. Furthermore, clus- 
terability cannot be characterized in topological terms only. 

Proposition 5.4. Every class has a semiclusterer using the halting problem as 
an oracle. Furthermore, a class S = {Ai, A 2 , . . .} is semiclusterable without any 
oracle if the representation of the class is a uniformly recursive family, that is, 
if {(i, x) € N 2 : x £ Ai} is recursive and not only recursively enumerable. 

By Property 5.3 one can separate learnability from clusterability and semiclus- 
terability by a class satisfying the Finite Containment Property. The next results 
show that there are no implications between the notions of learnability, cluster- 
ability and semiclusterability except the following two: “clusterable => semiclus- 
terable” and “clusterable => learnable”. All nonimplications are witnessed by 
classes satisfying the Finite Containment Property. 

Proposition 5.5. Let the class S consist of the clusters A 3 ; = {(i,x) :xeN} 
and A$i + j = {(i,x) : x = j + 1 modulo 2 and x < 2 + |Wj|} where i £ N and 
j £ {1,2}. The class S satisfies the Finite Containment Property. Furthermore, 
S is semiclusterable and learnable but not clusterable. 
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Example 5.6. Assume that Wo = N. The class { {(i,x) : x £ Wj} : j < i} is 
neither learnable nor clusterable. But it is semiclusterable and satisfies the Finite 
Containment Property. 

6 Numbering-Based Properties 

Every uniformly recursively enumerable class of pairwise disjoint sets is learn- 
able: the learner just waits until it finds x £ range(cr) and i < |oj such that 
x is enumerated into A, within \a\ steps; from then on the learner outputs the 
index i. But for nonrecursive sets C, the class Sc witnesses that such a class is 
not clusterable. So one has to consider not only properties of the class but also 
properties of some of its numberings. A class {Aq,A\, . . .} has the Numbering- 
Based Finite Containment Property if for every I there are only finitely many j 
with Aj C Aj. 

Proposition 6.1. A class of pairwise disjoint sets has the Numbering-Based 
Finite Containment Property iff it is clusterable. 

The class given in Example 5.6 satisfies the Numbering-Based Finite Contain- 
ment Property but is not clusterable. The next two results present more restric- 
tive sufficient conditions. 

Proposition 6.2. Assume that Ai % Uj^Aj for all i and that it is decidable 
whether two sets Aj, Ay intersect. Then S = {Aq,A\, . . .} is clusterable. 

Proposition 6.3. Assume that S = {Aq, Ai, . . .} satisfies the following three 
conditions: 

1. every Ai is infinite; 

2. if i yf j then A, D Aj is finite; 

3. S is uniformly recursive, that is, {(*,#) : x € Aj} is recursive. 

Then S is clusterable. But no two of these three conditions are sufficient for 
being clusterable. 

7 Geometric Examples 

The major topic of the last two sections is to look at sets of clusters which are 
characterized by basic geometric properties. Therefore the underlying set is no 
longer N but the ^-dimensional rational vector space Q fc , where k £ {1, 2, . . .} 
is fixed. The classes considered consist of natural subsets of Q fc . Except for the 
class S'accu.fc hr Proposition 7.2 below, the following holds: the clusters are built 
from finitely many parameter-points in Q k ; the clusters are connected sets; every 
task consists of clusters having a positive distance from each other. So there is 
a unique natural way of breaking down a task into clusters. 

Recall that a subset U C Q fc is affine iff for every fixed x £ U the set 
V = {y £ Q k : x + y £ U} is a rational vector space, that is, closed under 
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scalar multiplication and addition. The dimension of U is the dimension of V as 
a vector space. 

Example 7.1. Let S a ff,k be the class of all affine subspaces of Q k which have 
dimension k— 1. The class S a ff ^ is clusterable but the class S a ff,k U {Q fc } is not. 

Proposition 7.2. Let k be a positive natural number and S accUt k be a class 
{Ao,Ai,...} of bounded subsets of Q k for which there is a recursive and one- 
one sequence ao, oi, . . . of points in Q fc satisfying the following: (1) every Ai has 
exactly one accumulation point which is ai; (2) no accumulation point of the set 
{ao, ai, . . .} is contained in this set. Then the class S aC cu,k is clusterable. 

8 Clustering with Additional Information 

Freivalds and Wiehagen [7] introduced a learning model where the learner re- 
ceives in addition to the graph of the function to be learned an upper bound on 
the size of some program for this function - this additional information increases 
the learning power and enables to learn the class of all recursive functions. 

Similarly, a machine receiving adequate additional information can solve ev- 
ery clustering task for the class S conVt k defined below. But without that addi- 
tional information, S CO nv,k is not clusterable. So the main goal of this section is to 
determine which pieces of additional information are sufficient to cluster certain 
geometrically defined classes where clustering without additional information is 
impossible. 

Definition 8.1. For a given positive natural number k, the class S conv ^k contains 
all subsets of Q fc which are the rational points in the convex hull of a finite subset 
of Q k . 

Proposition 8.2. The class S conVt k is semiclusterable but not clusterable. 

Proposition 8.3. The class S conVt k is clusterable with additional information if 
for any task I one of the following pieces of information is also provided to the 
machine M : 

— the number |/| of clusters of the task; 

— a positive lower bound for min({l} U {d(A, Aj) : i,j £ I Ai ^ j}); 

— the minimal number of points which are needed to generate all the convex 
sets Ai with i £ I. 

The last results of the present work deal with conditions under which nonconvex 
geometrical objects can be clustered. The first approach is to look at unions of 
convex objects which are still connected. For k = 1, this class is the same as 
Sconv.i- But for k = 2, this class is larger. There the type of additional informa- 
tion used for clustering S conv ,k is no longer sufficient. Given both the number of 
clusters and the number of vertices as additional information, it is possible to 
cluster the natural subclass S'poiygon^ of all classes considered. But if one permits 
holes inside the clusters, this additional information is no longer sufficient. An 
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alternative parameter is the fc-dimensional area covered by a geometric object. 
In Example 8.7 a natural class S area ,k is introduced which can be clustered with 
the area of a clustering task given as additional information. The class S areat 2 
contains ,S' P oiygon ,2 and the class from Example 8.6 as subclasses. 

Definition 8.4. A polygon is given by n vertices q\, qi, . ■ . , q n €E Q 2 and is the 
union of n sides which are the convex hulls of {q\, < 72 }> {f? 2 > Q 3 }, ■ ■ ■ , Qi}- The 
sides do not cross each other and exactly two sides contain one vertex. Every 
side has positive length and the angle between the two sides meeting at a vertex 
is never 0, 180 or 360 degrees. Let po,Pi, ... be an enumeration of the polygons 
and let Pi be the set of all points in Q 2 which are on the polygon p t or in its 
interior. Let n t denote the minimum number of vertices to define the polygon pi 
and Sp 0 i y gon ,2 be the class {P 0 , Pi, . . .}. 

Proposition 8.5. The class S po iygon ,2 = {Po,Pi,...} is clusterable with addi- 
tional information in the sense that it is clusterable from the following input 
provided to a clusterer for task I in addition to a text for Pi: the cardinality |/| 
and the number Yh iei n i- Clustering is impossible if only one of these two pieces 
of information is available. 

Example 8.6. Let B t _j = pj U (P* — Pj) and mtj = Ui + Uj if Pj C - p it - 
otherwise let Bij = Pi and niij = ni. Let Shoie ,2 consist of all sets Bij. Then 
it is impossible to cluster Shoie ,2 if besides a text the only pieces of additional 
information supplied are |J| and j)ei m i,j- 

Example 8.7. Let S area ,k = {Ao, Ai, ...} be the class of finite unions of members 
of S conV) k which are connected and have a positive k-dimensional area. Without 
loss of generality the set {(i,x) : x € A;} and the function mapping i to the 
area of A* are recursive. Then there is a clusterer for S area) k which uses the 
area of the members of a cluster as additional information. But S ar ea,k cannot 
be clustered without additional information. 



9 Conclusion 

Clustering is a process which makes important use of prior assumptions. Indeed, 
not every set of points in an underlying space is a potential cluster; geometric 
conditions for instance play an important role in the definition of the class of 
admissible clusters. Whereas such conditions have been taken into account in 
previous studies, none of those has investigated the consequences of the more 
fundamental requirement that clustering is a computable process. This paper 
shows that recursion-theoretic and geometric conditions can both yield substan- 
tial insights on whether or not clustering is possible. It also explores to which 
extent clustering depends on computational properties, by characterizing the 
power of oracles for clustering. It is expected that further studies of the inter- 
action between topological, recursion-theoretic and geometrical properties will 
turn out to be fruitful. 
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Abstract. Grammatical inference consists in learning formal grammars 
for unknown languages when given sequential learning data. Classically 
this data is raw: Strings that belong to the language and eventually 
strings that do not. In this paper, we present a generic setting allowing 
to express domain and typing background knowledge. Algorithmic so- 
lutions are provided to introduce this additional information efficiently 
in the classical state-merging automata learning framework. Improve- 
ment induced by the use of this background knowledge is shown on both 
artificial and real data. 

Keywords: Automata Inference, Background Knowledge. 

Toward Grammatical Inference with Background Knowledge. Gram- 
matical inference consists in learning formal grammars for unknown languages 
when given sequential learning data. Classically this data is raw: Strings that 
belong to the language and eventually strings that do not. If no extra hypothesis 
is taken, state merging algorithms have shown to be successful for the case of 
deterministic finite automata (Dfa) learning. One of the most representative 
and simplest algorithm in this family is Rpni [14]. Convergence of Dfa learning 
algorithms depends on the presence of characteristic elements in the learning 
sets. In real applications some of these elements may be missing, but could be 
compensated by other sources of information. For instance if the learner is al- 
lowed to ask questions about the unknown language the setting is that of active 
learning : In the case of learning Dfa, algorithm L* [1] has been proposed. In the 
absence of a source of reliable counter-examples, the class of learnable languages 
is more restricted. Either subclasses of the regular languages are used, or statisti- 
cal regularities can be used to help the inference. Then, with the hypothesis that 
not only the language but also the distribution is regular, one can hope to learn 
a stochastic deterministic finite state automaton, as with algorithm Alergia [4] . 

In other areas of machine learning, successful techniques have been invented 
to be able to learn with more additional information and especially with the 
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capacity of including expert knowledge: A typical example is that of inductive 
logic programming (Ilp) [13]. Ilp is concerned with learning logic programs from 
data that is also presented as facts. Furthermore some background knowledge can 
be presented to the system using the same representation language (first order 
logics) as the data or the program to be learned. 

In grammatical inference, specific knowledge relative to the application is 
often included by ad hoc modification of the algorithm (see for instance [10]). 
Very few automata inference algorithms allow the user to express a bias to guide 
the search. In this article, we propose a method enabling to integrate two kinds 
of background knowledge into inference: First, domain bias which is a knowledge 
on the language recognised by the target automata, and second typing bias which 
considers semantic knowledge embedded in the structure of the automata. 

Domain and Typing Background Knowledge. 

Domain bias is often available and may sometimes supply the lack of counter- 
examples. In [15], domain information may be provided for learning subsequen- 
tial transducers. The domain is the set of sequences such that translating them 
make sense, i.e .: Those sequences that belong to the original language model. If 
the exact domain is given, (partial) subsequential transducers may be identified 
in the limit. In classification tasks performed by automata, the domain may be 
the set of sequences of interest to be classified inside or outside the concept to 
learn. Take for instance a task involving well formed boolean expressions: The 
sequence “->)” should not appear. But it will neither appear in the correctly la- 
belled examples (those that will evaluate to true) , nor in those that will evaluate 
to false. If - as would be the case with a typical state merging algorithm - one 
depends on the presence of a counter-example containing “->)” in the charac- 
teristic set for identification to be possible, then we would really have no hope 
to identify. If on the other hand we can express as domain background knowl- 
edge that no string can contain “->)” as a substring, then identification could 
take place even when some counter-example cannot be given due to the intrinsic 
properties of the problem. We propose in section 2 an algorithm which allows to 
introduce this kind of background knowledge into automata inference. 

Typing Bias: In [10], the authors add types to the states of the inferred automa- 
ton such as to introduce semantics on the symbols of the alphabet. These types 
are then used to constraint the structure of the inferred automaton, therefore 
embedding the type semantic in its structure. In [11], two of the authors proposed 
a formalism to represent the typing information. However, in this framework, the 
conditions to be met by the typing information are restrictive. Section 3 proposes 
an algorithm enabling to get rid of the limitations of this previous framework. 

The remainder of the paper is structured as follows: First we review the 
basics of automata inference by state merging methods. Then, we apply the 
ideas presented in [5] to take specifically into account some domain bias in the 
inference process. The same algorithmic ideas are then applied for tackling more 
expressive typing functions than in [11]. We end this study by two particular 
cases allowing a more efficient treatment: We revisit the results of [11] under 
the formulation of the specialisation of a (type) automaton and we study the 
practical case when the sole information available is a typing of the sequences. 
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1 State Merging Inference of Automata 

Languages and Automata. An alphabet A is a finite nonempty set of sym- 
bols. E* denotes the set of all finite strings over E. A language L over E is 
a subset of E*. In the following, unless stated otherwise, symbols are indi- 
cated by a,b,c..., strings by u,v,..., and the empty string by A. A finite 
state automaton (Fsa,) A is a tuple (E,Q,Qq,F,5) where E is an alphabet, 
Q is a finite set of states, Qo Q Q is the set of initial states, F C Q is the 
set of final states, S : Q x E — > 2*3 is the transition function. The transi- 
tion function S is classically extended to sequences by: Vg £ Q, 6(q,X) = {g}, 
\/q £ Q, \/w £ E* , Va £ E, 6{q,aw) = U g ' G 5( 9 ,a) V. w )- The language L(A) 
recognized by A is {u> £ E* : 3 g 0 £ Qo,S(qo,w) D F y^ 0}. Languages recog- 
nized by Fsa are called regular. For a given state q, the prefix and the suf- 
fix sets of q are respectively P(q) = {w £ E* : 3g 0 £ Qo,q £ S(q 0 ,w)} and 
S(q) = {w £ E* : 3 qf £ F,qf £ 6(q,w)}. The above definitions allow Fsa to 
have inaccessible states or useless states (from which no string can be parsed) 
but we will not consider such automata in this paper, i.e .: In the sequel, we 
will assume that Vg £ Q,P(q) y^ 0 and S(q) yf 0. A deterministic finite au- 
tomaton (Dfa,) is an Fsa verifying |<5o| = 1 and Vg £ Q,Va £ E, |<5(g, a)| < 1. 
An automaton is unambiguous if there exists at most one accepting path for 
each sequence. Dfa are trivially unambiguous. For each regular language L, the 
canonical automaton of L is the smallest Dfa accepting L; It is denoted A(L). 

State Merging Algorithm. Regular languages form the sole class in the 
Chomsky Hierarchy to be efficiently identifiable from given data [7]. The most 
popular approach is the state merging scheme used to learn Fsa but also ap- 
plied to the inference of sub-sequential transducers, probabilistic automata and 
more expressive grammars. This approach relies on the state merging operation 
which consists in unifying the states (or for grammars, the non-terminals). This 
operation increases the number of accepted sequences and is thus a generali- 
sation operation. The general framework of state merging inference is given by 
algorithm 1. It consists in first building from the set I + of example strings a 
maximal canonical automaton, in short Mca, that recognizes only these strings, 
and then in applying the state merging operation iteratively while preserving 
a “compatibility” property used to avoid over-generalisation. When a set /_ of 
counter-example strings is available, the compatibility condition prevents over- 
generalisation by rejecting automata accepting at least one string of /_ . The 
search may be restricted to deterministic (respectively unambiguous) automata 
by using a merging for determinisation (respectively for disambiguisation) op- 
eration after each merge() [9,6]. 

Two functions are likely to be modified to introduce an application based 
bias. The function choose_states() allows to introduce search heuristics while 
the function compatible () controls the generalization and restricts the search 
space. In this article, we will focus on the latter to integrate background knowl- 
edge. 
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Algorithm 1 Generic State Merging Algorithm 
Require: training set I + (set of example strings) 
Ensure: A = {£, Q, F, 8, Q o) is compatible 
A <- Mca (7+) 

while (<71,92} <— choose_states(Q) do 

A' <- merge(A,(5i,g 2 )) 

if compatible(A') then A <— A' 



2 Using Domain Bias 

A first natural way to give information helping the inference is to force the hy- 
pothesis language, denoted hereafter by L, to be included in a more general one, 
denoted by Lg- In practice, this more general language may be defined according 
to the knowledge of the maximal domain of the sequences such that classifying 
them (as being positive or negative) makes sense. One can also consider that 
this general language is simply an over general hypothesis , obtained from a ma- 
chine learning system or from an expert, which has to be specialized. Since 
L C Lg tlfn (S* — Lg) = 0, learning a language included in Lq may also be 
seen as the inference of a language given an (eventually) infinite set of counter- 
examples defined by a language L_ = S* — Lg instead of (or in addition to) the 
traditional finite set /_. Obviously these three interpretations ( Lq , L_ and /_) 
are not exclusive and may be combined to obtain a more general language Lg 
from the expert knowledge on each of these aspects. 

The boolean formulae example given in introduction, or a known constraint 
on the lenghts of the strings of the target language are both instances of domain 
bias. We give in this section a practical algorithm to ensure during the inference 
that the hypothesis language L remains included in Lq- More precisely, we take 
the equivalent “counter-example language” point of view, i.e.: We assume that 
we are given an automaton A_ such that L(A_) = L_ = (21* — Lg), and 
we propose to ensure that L D L_ = 0. If we are given the (deterministic) 
canonical automaton A(Lg), A_ can be easily computed by completing A(Lq) 
before inverting final and non final states. One can remark that no assumption of 
determinism is required on A_ allowing a compact representation, if available, 
of L_. In particular, A_ can easily be defined as the union of different (non 
deterministic) automata representing different sources of information without 
needing a potentially costly determinization. In return, we have to note that 
this is not true for Lq since a non deterministic automaton of this language 
would have to be determinized before complementation. 

If we denote by A the current automaton representing the current language 
hypothesis L, a direct way to handle this problem would be to build the in- 
tersection product of A and A_ to check whether the intersection language is 
empty at each step of the inference. The algorithm presented hereafter can be 
seen as a simulation of this intersection product to detect incrementally non 
emptiness by introducing and maintaining a common prefix relation between 
the states of each automata defined as follow: Let A\ = (£, Qi, Q 0 i, Fi, Si) and 



Introducing Domain and Typing Bias in Automata Inference 



119 



Algorithm 2 Computation and incremental update of common prefix relation 
(using the notations A\ = (£, Qi, Q 0 i> F x , 5i) and A 2 = (S,Q 2 ,Q 02, F 2 ,6 2 )) 
Function common_pref ix(Ai, A 2 ) 

£ p <— 0 {common prefix relation storage} 
for all qi £ Qoi,? 2 £ Q02 do 

add_to_common_pref ix(Ai, A 2 , q\,q 2 ,£ P ) 
return £ v 

Procedure add_to_common_pref ix(Ai, A 2, q\,q 2 ,£ p ) 
if {qi, 52} 0 £ P then 
£ P <- {{qi,q 2 )}yj £ P 
propagate J!orward(Ai, A 2 , qi,q2,£ P ) 

Procedure propagate Jrorward(Ai, A 2, q 1, q 2 ,£ P ) 
for all a £ E,q[ £ 5i(qi,a), q 2 £ S 2 (q 2 ,a) do 
add_to_common_pref ix(Ai, A 2 , q[,q 2 ,£ p ) 

Procedure update_af ter_state_merging(Ai, A 2 , q, q' , £ p ) 

Require: q is the resulting state of merging q and q in A\ 
for all q 2 : {q, q 2 ) £ £ P do 

propagate jforward(Ai, A 2 , q, q 2 , £ p ) {handling new transitions added to q} 
for all q 2 : (q',q 2 ) £ £ P do 

add_to_common_pref ix(Ai, A 2 ,q, q 2 ,£ P ) {reporting relations of q '} 



A 2 = (A, Q 2l Q02, F 2 , 5 2 ). Two states qi,q2 in Q 1 x Q 2 share a common prefix 
iff P(qi) fl P{q2) 7^ 0- It is easy to see that the intersection of L(A\) and L(A 2 ) 
is not empty iff there exists a couple of final states qi, q 2 G Fi x F 2 in common 
prefix relationship. Thus to apply this detection to A and A_. it is sufficient to 
compute the set of states of A and A_ in common prefix relationship and to test 
whether two final states are in relation. 

Moreover, since common prefixes are preserved by state merging, this relation 
may be maintained incrementally during the inference process. The sketch of the 
algorithm computing and incrementally updating the common prefix relation 
(algorithm 2) is then the following. At the initialization step, the common prefix 
relation between A and A_ can be computed by the common_pref ix() function. 
Then, after each merge in A, new common prefixes created by the state merging 
are propagated using the update_af ter_state_merging() procedure. Detection 
of the constraint violation may be done simply by checking when a new common 
prefix relation is detected between a final state of the current automata A and a 
final state of A_ . At the initialization step, this violation means that the problem 
cannot be solved with the given data ( e.g L _ includes a sequence of /+). During 
the inference, it allows to detect that the last state merging is invalid and should 
not be done (involving a backtrack). 

The theoretical complexity of function common_pref ix() is in 0 (|A| x |A_| x 
tA x tA_) : where tA and are the maximum number of outgoing transitions 
by the same symbol from states of A and A_. In practice, a maximum sequence 
of |A| successful state-mergings can be achieved during inference. During such a 
sequence of state-mergings the complexity of the naive approach ( i.e Recom- 
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Fig. 1. Recognition level on the test set (i.e.: The number of correctly classified words 
of the test set divided by the test set size) of the solution given by Rpni [14] (extended 
with common prefix relation). The left plot shows how the recognition level improves 
(grey level) when considering an increasing “quantity” of domain bias (abscissa), and 
an increasing training sample size (ordinate). The right plot represents vertical sec- 
tions of the left plot. The experimental setting is the following: A 31 states target 
automata, a training and a testing set have been generated by the Gowaciiin server 
(http://www.irisa.fr/Gowachin/). The complementary A _ of the target automata has 
been computed to provide progressively domain information by considering an increas- 
ing number of its 15 final states. Each point in a plot is a mean of nine executions 
with different training sets and different choices for the final states of A- . Even if only 
illustrative, on can remark that the size of the training sample needed for Rpni to 
converge is strongly reduced when introducing domain bias (even for “small” bias) but 
that using only this information without providing a sample of a reasonable size does 
not enable the algorithm to converge. 



puting relations after each merge with function common_pref ix()) is therefore 
0(1 A\ 2 x | A- 1 xtAXtA_)- Using the incremental update_after_state_merging() 
procedure improves this complexity by a |A| factor. Indeed, the worst case com- 
plexity of this procedure is achieved when |A| x |A_| relations are stored from 
the beginning in £ p . Next |A| calls to update_after_state_merging() during 
the inference process have a complexity of 0(|A_| x tA x tA_)- Indeed, the 
update_af ter_state_merging() procedure try to propagate the |A_| relations 
existing on the merged state, but all propagations stop since the set of relations 
is already complete. This leads to a global complexity of 0( | A\ x | A_ | x t.A x f a_ ) • 

We have presented here the introduction of domain bias in a state merging 
inference algorithm. The proposed approach is generic enough to be adapted 
to other inference algorithms whenever they proceed by automaton generaliza- 
tion. In particular, since DeLeTe2 generalization operations [8] preserve common 
prefix relations, this approach could be easily applied to this algorithm. An ex- 
periment on artificial data showing the interest of using domain bias during 
automata inference is provided figure 1. 

In the next section, we propose to use the common prefix relation to take 
into account typing prior knowledge in the inference process. 
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3 Using Typing Information 

In applications, information on the symbols of the examples is often available. 
For instance, in genomics, the secondary structure to which the amino acid 
belongs in the protein is often known. In linguistics, one can have the result of 
the part of speech tagging or chunking of the sequences. This information may 
be considered as a typing of the symbols in the sequences and may have been 
given either by an unknown machine or an expert (in this case, only the result 
of the typing of example sequences is available) or by a typing machine provided 
by the expert as background knowledge (in this case, the typing is known for 
the examples but also for all the sequences in the domain of the machine). In 
sections 3.1 and 3.2, we will focus on the second case and will assume that we are 
given a typing machine by the expert. The first case will be studied specifically 
in section 3.3. 



3.1 Inference of Compatible Type Automata 

Formalisation of the Expert Knowledge: If we designate by E the symbol 
alphabet and by S the sort alphabet, a symbol typing function t is defined from 
E* x E x E* to S. We then designate by t(u, a, v) the sort associated to symbol 
a in string uav. From this definition, the sort can depend both on u, called the 
left context , on v, called the right context, and on the typed symbol itself a. The 
expert knowledge on sorts can be represented as a typing function r provided 
to the inference process. Therefore, the sorts returned by the r function have a 
semantic meaning on the basis of the expert knowledge. 

Since we are considering the inference of Fsa, we will assume here that 
the typing function provided is somehow “rational”, i.e .: That the typing could 
have been generated by a finite state machine. Thus, following [11], we consider 
that the typing function r provided by the expert can be encoded in a type 
automaton: A finite state type automaton T is a tuple ( E , Q , Qq, F, S, S, a) where 
E ,Q,Qo,F,6 are the same as in classical Fsa, S' is a finite set of sorts and 
a : Q — > S is the state typing function. A type automaton T represents a 

typing application r from E* x E x E* to 2 s defined by t(u, a, v ) = {cr(g)/3go £ 
Qo, 3qf £ F,q £ S(qo, ua) and qf £ S(q, t;)}. Such as to consider typing functions 
(i.e.: From E* x E x E* to S) instead of typing applications, we will restrict 
ourself to type automata such that \/u,v £ E* , Va £ E : \r(u,a,v)\ < 1. 

As improvments on [11], we authorize type automata to be non-deterministic 
to handle typing functions depending on the left context but also on the right 
context of the word. We also authorize incompletely specified typing functions 
as it is often the case in practice. Figure 2 gives an example of a type automaton. 

Integration of the Expert Knowledge in Inference: To integrate the typing 
information in the learned automaton, this one has to embed this information in 
its structure. To achieve this goal, one associate a sort to each state of the learned 
automaton, which can therefore be considered as a type automaton. The idea is 
that the typing function embedded in the structure of the inferred automaton 
has to be compatible with the typing function provided as expert knowledge. 
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Fig. 2. Expert knowledge on proteins family is available in the Prosite database 
(http://au.expasy.org/prosite/). This type automaton has been constructed from the 
Prosite pattern PS00142 (Zinc_Protease). E is the amino acids alphabet, the state 
typing function may be defined to return the state number on significative states. 
Knowledge embedded in this pattern is for instance the presence of the two amino acids 
H (Histidine) in the pattern because of their good properties to fix Zinc molecules. 



We denote by V T the domain of a typing function r which is the set of tuple 
(u, a, v) such that r(u, a, v) is defined. Thanks to our distinction between the 
domain and the typing knowledge, we use a weaker definition of the compatibility 
of two typing functions than the one proposed in [11]: Two typing functions r 
and t' are compatible iff V(u, a, v) £ V T IT T> t > , t(u, a, v) = t'(u, a , v). 

In the following, two type automata will be considered compatible if their 
typing functions are compatible. We propose to ensure the compatibility of the 
learned automaton A with the provided type automaton At during the inference 
process by a similar scheme to the one presented in section 2. But, since left and 
right contexts are considered for typing, common prefix relation is not sufficient. 
To propagate the acceptance information backward to the states, the common 
suffix relation is also used. It is defined as follow: Let A\ = {U,Q\,Qo\,F]_,8\) 
and A 2 = (F,Q 2 ,Q 02 ,F 2 ,S 2 ). Two states < 71,172 in Q\ x Q 2 are said to share a 
common suffix iff S(qi) D S(q 2 ) ^ 0. This definition is completely symmetric to 
the the definition of the common prefix relation and share the same properties. 
Thus computation and incremental maintenance of this relation can be done by 
similar functions to those given in algorithm 2 simply by replacing the set of 
initial states by the set of final states and by propagating the relations backward 
instead of forward by using the inverse function of transition. 

The common prefix and common suffix relations are used during the inference 
process to ensure the compatibility between the inferred automaton A and the 
automaton embedding the expert typing function At- This is done by assigning 
to each state <7 of A sharing both a common prefix and a common suffix with 
a state qr of At the sort of qx and by prohibiting assigning different sorts to 
the same state. At the initialization step, this projection of the sorts allows to 
initialize the state typing function of A. A failure at this point [i.e.\ Trying to 
assign two different types to one state) shows that the typing function of At is 
not functional or, in the deterministic case when the prefix tree acceptor is used 
as Mca [9], that the typing can not be realized by a Dfa. During the inference, a 
failure detection allows to detect that the last state merging is invalid and should 
not be done (involving a backtrack). As for domain background knowledge, an 
experiment on artificial data showing the interest of typing bias during automata 
inference is provided (figure 3). 

It may be easily shown that states of A with different sorts will always fail 
to be merged and that trying to merge them is useless. In figure 4, we show 
that the initial typing of the states is not sufficient, even for deterministic type 
automata, to ensure compatibility and that propagation of relations has to be 
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Fig. 3. Recognition level when given a type automaton At- The same experimental 
setting and the same automaton than in figure 1 is used except that instead of giving 
final states information, typing information is given progressively (here, between 1 and 
31 typed states of At chosen randomly, all the states are final, the left plot is truncated 
at 15 typed states). Convergence is faster since the given typing information is richer 
than the domain one. 




Fig. 4. At, deterministic MCA for /+ = {aa,ab} and A obtained by merging source 
and target state of the transition by b. Although the merged states have the same sort 
A, in At r(ab, a, A) = A whereas in the merged automaton r(ab, a, A) = B (and the 
problem remains even if we consider a complete sample wrt to At, e.g.: {aa, ab, ba, bb}). 



done after state merging to ensure that new accepted sequences are correctly 
typed. We will study in the next sections two special cases such that avoiding 
to merge states with different sorts is sufficient to ensure the compatibility. 

3.2 Specialization of a Regular Grammar 

In the setting of [11], typing is feasible only if: 1. The expert type automaton is 
deterministic, 2. It has a completely defined state typing function, 3. It accepts 
at least all words of /+, 4. It has different sorts for all its states (i.e. \/q,q' € 
Qr,q 7 ^ q' '■ criq) ^ cr((/)), and 5. The inferred automaton is deterministic. 
The setting presented in section 3.1 removes all these constraints to the cost of a 
more important complexity. If we consider these restrictions, [11] show that the 
initial typing of the MCA can be realised in 0(| MCA |), and the maintenance of 
types can be realized in 0 ( 1 ) after each state merging. 

In fact, the determinism restrictions 1. and 5. can be removed while keeping 
the 0(1) type managment after each state merging. This is done by realizing 
the initial typing as proposed in subsection 3.1. The maintenance of types along 
inference is realized in 0 ( 1 ) by preventing to merge states with different state 
typing functions. The formal proof follows the same lines as the one provided 
in [11]. This extension is interesting because it allows to handle efficiently some 
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typing functions that cannot be represented by deterministic type automata 
with different sorts for all states (typing functions r such that r(u,a,bv) = 
t(u ' , a' ,bv') and r(ua,b,v) ^ t(u' a! ,b,v’)) but that can be represented by non 
deterministic ones. 

In [11], the semantic of the restrictions stays unclear. We discuss here an 
interpretation and the links between this typing framework and domain back- 
ground knowledge. Under constraints 2. to 4., all compatible automata of the 
search space in the state-merging framework can be seen as specializing the 
structure of At . Indeed, the constraint 4. implies a one to one correspondence 
between the sorts and the states of At and thus the information given by typ- 
ing is the structure of At- Let A' T be the sub-automaton of At obtained by 
discarding useless states for the acceptance of I+- We can show that A' T can be 
obtained by merging states of any compatible automaton of the search space. 
Therefore these automata are specializations of At- 

By inferring automata A specializing the structure of At, we also specialize 
the recognized language, i.e. L{A) C L{At)- Then, by constraining the structure, 
the domain is also constrained. We obtain a setting such that both the structure 
and the language of the automaton At are specialized and we should rather 
consider that this setting corresponds to the specialization of an automaton (not 
necessarily typed). In particular, it should be noticed that in this setting some 
automata verifying the language constraint, but not the structural one, can be 
excluded from the search space. 



3.3 Inference Given Typed Examples 

A practical case such that updating common prefix and common suffix relations 
is not necessary is when we are given only typed examples. A type automa- 
ton At could easily be constructed from the typed examples and would verify 
L(At) — /+. In that case, no typing constraint is given about sequences outside 
the training set and the sole remaining constraint is to avoid creating two accep- 
tance paths leading to different typing for a sequence of /+ (to ensure compatibil- 
ity) what can be avoided by considering the inference of Dfa or more generally 
of unambiguous type automata [6] (allowing also to ensure functionality of the 
learned type automaton). Then, one can show that the initial projection of the 
sorts to the Mca and forbidding merging states with different sorts is sufficient 
to ensure the compatibility. But it should be noticed that some complementary 
state merging (for determinization or for disambiguisation) may have to be done 
before detecting the failure. This case should rather be considered factually as a 
case such that the typing information can be directly encoded in the automata 
A and does not need a second automata At besides it. 

Inference given typed examples is interesting when no type automaton is 
given and only a tagging of the examples is available. This setting is also inter- 
esting if a typing function that cannot be converted into a typing automaton is 
available. In this case, the function can be used at least to type the examples. We 
illustrate this by an experiment on the ATIS task using Part-Of-Speech Tagger 
typing information (figure 5). 
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Fig. 5. We have tested the use of type informa- 
tion for automaton inference on the Air Travel 
Information System (ATIS) task [12]. This corpus 
has been widely used in the speech recognition 
community to test various language models. The 
type information was composed of Part-Of-Speech 
(POS) tags provided by the Brill tagger [2]. Here 
we inferred stochastic automata using Alergia in- 
ference algorithm [4] with and without the POS- 
tag information. Each word was tagged with its 
most probable POS-tag, disregarding the context-rules in the tagger. In this task, 
a usual quality measure is the perplexity of the test set S (ordinate) given by 
P = 2 _ T^T S'"es l °S 2 p(™) , w jj ere p( w ) i s the probability given by the stochastic au- 
tomaton to the word w. The smaller the perplexity the better the automaton can 
predict the next symbol. The sentences with 0 probability are ignored in this score 
(the presence of one of these sentence would lead to an infinite perplexity). So to eval- 
uate the results, we also have to represent the percentage of sequences accepted, i.e\ 
With non 0 probability (abscissa). In the Alergia algorithm, generalisation is controled 
by one parameter. Different values for this parameter provided the different points of 
the figure. The best results are situated in the bottom right corner as they correspond 
to high coverage and small perplexity. For a given number of sentences parsed, the use 
of POS-tag based type reduces the partial perplexity and provides better models. 




Discussion, Conclusion and Perspectives 

We have proposed a generic setting to use domain and/or typing knowledge for 
the inference of automata. This setting includes the non deterministic and in- 
complete knowledge cases and allows different degrees of efficiency. In particular, 
two practical cases have been identified such that the expert knowledge can be 
taken into account with a small over-cost. Experiments on real applications are 
now needed to validate the approach and to quantify (experimentally, but also 
theoretically) the amount of the given help. 

As pointed by one of the reviewers the presented models have to be compared 
to existing models to learn in helpful environments. A starting point of that 
research for the domain bias could be a comparison with inference being allowed 
membership queries [1]. Indeed a language of counter-examples provided as an 
automaton can answer some of the membership queries (the ones concerning 
words in the language of A_). For the typing bias, a comparison with the work 
of [3] has to be realised. We also have to explore the fact that the comportement 
of our algorithms is unclear when the provided knowledge is erroneous. In this 
case, a solution could be to use this knowledge as a heuristic instead a pruning 
constraint. 

Another promising perspective is to study the extension of this work to more 
powerful grammars. In particular, coupling non terminals with typing could 
provide an interesting framework to include semantic background knowledge in 
the inference of context-free grammars and should be compared to the parsing 
skeleton information used with success by Sakakibara [16]. 
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Abstract. We present a definition of analogy on sequences which is 
based on two principles: the definition of an analogy between the letters 
of an alphabet and the use of the edit distance between sequences. Our 
definition generalizes the algorithm given by Lepage and is compatible 
with another definition of analogy in sequences given by Yvon. 



1 Introduction: Learning Sequences by Analogy 

We study in this paper a lazy method of supervised learning in the universe of 
sequences. We assume that there exists a learning set of sequences, composed 
of sequences associated with class labels. When a new sequence is introduced, a 
supervised learning algorithm has to infer which label to associate with this new 
sequence. 

Lazy learning makes no parametric assumption on the data and uses only the 
learning set. The simplest lazy learning technique is the nearest neighbor algo- 
rithm: the label attributed to the new sequence is that of the nearest sequence in 
the learning set. It requires a definition of a distance (or at least a dissemblance) 
between sequences. 

Analogy is a more complex lazy learning technique, since it is necessary to find 
three sequences in the learning set and to use a more sophisticated argument. 
Let X be the sequence to which we want to give a label. We have to find three 
sequences A, B and C in the learning set, with labels L(A ), L(B) and L(C), 
such that " A is to B as C is to X". Then the label of X will be computed as 
”L(X) is to L(C ) as L(B) is to L(A)" . This is for example the way that we can 
guess the past of the verb “to grind" knowing that the past of the verb “to bind" 
is “bound". 

Learning by analogy requires to give a definition to the terms “is to" and “as". 
This is the primary goal of this paper, which is organized as follows. In section 2, 

* The research reported here was supported by CNRS interdisciplinary program 
TCAN Analangue and some of its ideas have been elaborated during its meetings. 
We especially thank Nicolas Stroppa and Frangois Yvon at ENST Paris for their 
comments and help in formalizing Section 3.3 
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we will define more precisely what is analogy, especially on sequences. We present 
our approach, based on the edit distance and on the resolution of analogical 
equations on letters. In section 3, we remind what is the edit distance and we 
give a formal framework firstly to solve analogical equations on letters, secondly 
to compute all the solutions of analogical equations on sequences. Finally, in 
section 4, we will compare our proposition with related works and show that 
it generalizes previous algorithms. We then conclude and give some possible 
extensions to our work. 



2 What Is Analogy on Sequences 

2.1 Analogy 

Analogy is a way of reasoning which has been studied throughout the history of 
philosophy and has been widely used in Artificial Intelligence and Linguistics. 
Lepage ([1], in French, or [2], in English) has given an extensive description of 
the history of this concept and its application in science and linguistics. 

An analogy between four objects or concepts: A, B, C and D is usually 
expressed as follows: " A is to B as C is to D" . Depending on what the objects 
are, analogies can have very different meanings. For example, natural language 
analogies could be: "a crow is to a raven as a merlin is to a peregrine" or 
"vinegar is to bordeaux as a sloe is to a cherry". These analogies are based 
on the semantics of the words. By contrast, in the formal universe of sequences, 
analogies such as "abed is to abc as abbd is to abb" or "g is to gg as gg is to ggg" 
are syntactic. 

Whether syntactic or not, the examples above show the intrinsic am- 
biguity in defining an analogy. Some would have good reasons to prefer: 
"g is to gg as gg is to gggg". Obviously, such ambiguities are inherent in se- 
mantic analogies, since they are related to the meaning of words (the concepts 
are expressed through natural language). Hence, it seems easier to focus on 
formal syntactic properties. And resolving syntactic analogies is also an opera- 
tional problem in several fields of linguistics, such as morphology and syntax, 
and provides a basis to learning and data mining by analogy in the universe of 
sequences. 

The first goal of this paper is therefore to give a definition of a syntactic 
analogy between sequences of letters. 



2.2 Analogies and Analogical Equations on Sequences 

In this section, we focus on concepts that are sequences of letters in a finite 
alphabet and we are interested in studying what is an analogy on these concepts. 

Our development will be based on two basic ideas. Firstly, we will formalize 
the comparison of sequences through the classical notion of edit distance between 
sequences; we will give a method in section 3 which will be proved to transform 
the problem of analogy between sequences into that of analogy between letters. 
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We will also introduce an algebraic definition of the analogy between letters, or 
more generally between the elements of a finite set (section 3.2). 

Unlike D. Hofstadter et al. [3], we will not a priori consider as correct the 
following analogy: "abc is to abd as ghi is to ghj", since we assume that the 
alphabet of the sequences is simply a finite set of letters, with no order relation. In 
that, we will stick to the classical definition of an alphabet in language theory. 
Of course, adding properties on an alphabet increases the possible number of 
ambiguities in resolving analogies. If we want to give an algorithmic definition, we 
have to focus our interest on problems with the lowest number of free parameters. 

We denote now " A is to B as C is to D" by the equation A : B = C : D and 
we say informally that solving an analogical equation is finding one or several 
values for X from the relation A : B = C : X . We will give a more precise 
definition at the end of this section. 

The classical definition of A : B = C : D as an analogical equation re- 
quires the satisfaction of two axioms, expressed as equivalences of this primitive 
equation with two others equations [2]: 

Symmetry of the ‘as ’ relation: C : D = A : B 
Exchange of the means: A : C = B : D 

As a consequence of these two basic axioms, five other equations are easy to 
prove equivalent to A : B = C : D : 

Inversion of ratios: B : A = D : C 

Exchange of the extremes: D : B = C : A 
Symmetry of reading: D : C == B : A 
B : D = A: C 
C:A = D:B 

Another possible axiom ( determinism ) requires that one of the following triv- 
ial equations has a unique solution (the other being a consequence): 

A: A=B:X => X = B 
A: B= A: X => X = B 

We can give now a definition of a solution to an analogical equation which 
takes into account the axioms of analogy. 

Definition 1 X is a correct solution to the analogical equation A : B = C : X 
if X is a solution to this equation and is also a solution to the two others equa- 
tions: 

C:X=A:B and A:C=B:X 

3 A Formal Framework for Solving Analogies 

As our approach is based on the edit distance, we firstly remind what it is. In 
the second part of this section, we give sufficient (algebraic) arguments to justify 
the resolution of all analogical equations on letters. Then we present a formal 
way to characterize the solutions of equations on sequences by using transducers 
and finite state automata. We have derived an algorithm [4] from these formal 
approaches. 
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3.1 The Edit Distance Between Two Sequences 

We give here notations and definitions of elementary language theory [5] and we 
recall what is the edit distance between sequences. 

Basic Definitions About Sequences. Let A be a finite set that we will call 

an alphabet. We call letters a,b,c,... the elements of £. We denote u, v, ... or 
A, B, . . . the elements of £* , called sequences or sentences or words. A sequence 
u = U\U 2 ■ ■ ■ u\ u \ is an ordered list of letters of S ; its length is denoted |u|. e, 
the empty sequence , is the sequence of null length. We use the classical notion 
of concatenation: if u = u\u 2 ■ ■ ■ u\ u \ and v = V\V 2 ■ ■ ■ uui, their concatenation is 
UV = UiU 2 ■ ..U\ u \ViV 2 ■ ■ 

Alignments as a Particular Case of Transformation. To compute the edit 
distance, we have to give more definitions and quote a theorem, demonstrated 
by Wagner and Fischer in [6]. We first have to introduce the notion of edition 
between sequences. This edition is based on three edit operations between letters: 
the insertion of a letter in the target sequence, the deletion of a letter in the 
source sequence and the substitution, replacing a letter in the source sequence by 
another letter in the target sequence. Each of these operations can be associated 
to a cost C. We denote C a ^b the cost of substitution from a into b, C a ^ e the cost 
of deletion of a and C e ^b the cost of insertion of b. The cost of the edition between 
sequences is the sum of the costs of the operations between letters required to 
transform the source sequence into the target one. 

Definition 2 The edit distance is the minimum cost of all possible transforma- 
tions between two sequences. 

Definition 3 An alignment between to words x, y £ S* , which respective length 
is m and n, is a word z on the alphabet (AU{e}) x (AU{e})\{(e, e)} which pro- 
jection on the first compound is x and which projection on the second compound 
is y. 

Informally, an alignment represents a sequence of edit operations. The sub- 
stitution (e, e) is not an edit operation. It can be presented as an array of two 
rows, one for x and one for y, each word completed with some e, both resulting 
in words having the same length. 

For instance, here is an alignment between x = abgef and y = acde: 

x 1 =abegef 

I II II I 

y 1 =acdeee 

In the following, we will denote x 1 = abegef and y 1 = acdeee the two sen- 
tences created by the optimal alignments between x and y. The sentences x and 
x 1 , on one hand, and y and y 1 , on the other hand, have the same semantics in 
language theory, since e is the empty word. 
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The following theorem [6], states that the only transformations to be con- 
sidered for computing the edit distance are the alignments. In this theorem, C 
denotes the cost of transformation from a letter into another (C is either C a —>b , 

Ce—>b C a — ,e- 

Theorem 1 If C is a distance on E, then the edit distance D can be computed 
as the cost of an alignment with the lowest cost that transforms x into y. 

An alignment corresponding to the edit distance, that of lowest cost, is often 
called optimal. It is not necessarily be unique. 

It is now possible to use the classical dynamic programming Wagner and Fis- 
cher algorithm [6] which computes the edit distance and the optimal alignment. 
A consequence of this algorithm is the following remarkable result [7], which 
justifies the name of edit distance: 

Theorem 2 If C is a distance on {E U {e}) then D is also a distance on E* . 

This algorithm can be completed to construct x 1 and y 1 from x and y and to 
produce the optimal alignment, or all the optimal alignments if there are more 
than one. This is done by memorizing more information during the computation 
and by backtracking on an optimal path in the final matrix [8] computed by the 
Wagner ans Fischer algorithm. 



3.2 Analogical Equations on Alphabets and Finite Groups 

In this section, we give a method for solving analogical equations on sequences 
based on the edit distance. This approach is composed of two steps. The first 
one is to give a correct solution to an analogical equation on a set composed of 
a finite alphabet E plus the empty string e. The aim of this section is to give an 
algebraic definition of what is analogy between letters and to find a link between 
this algebraic structure and the distance <5 on E U {e} (c/ section 3.1). 



An Algebraic Definition of Analogy. In a vector space, the analogy OA : 
OB A OC : OX is quite naturally solved by choosing X as the fourth summit 
of the parallelogram built on A, B and C (Fig. 1). Obviously, this construction 
verifies all the axioms of a correct solution to the analogical equation. 

A parallelogram has two equivalent definitions: either 

AB = CX (or :AC = BX), or OA + OX = OB + OC 

Let us firstly consider the second definition (we will come back to the first 
one later). If we want to transfer it into an alphabet E, we have to define in the 
same manner an operator ® from E U {e} x E U {e} to a set F such that: 

a(Bx=b(Bc<=>a:b = c:x 

It is not important what F is for the moment. We want to define what such an 
operator ® can be and what structure it would give to the set E U {e}. 
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Fig. 1 . Analogy in a vector space. 



Properties of Analogy According to the Operator ©. We have given in 
section 2 the axioms of analogy as described by Lepage. Let us rewrite them with 
the operator ®. For each axiom, we exhibit a corresponding algebraic property 
and this will allow us to determine more precisely what properties the operator 
® must have. 

Symmetry: a : b = c : d => c : d = a : b that is a®c?=6®c=>c®6 = c?®a 

Exchange of the means: a : b == c : d => a : c == b : d that is a®d = 6®c=> 
a ® d = c ® b 

From this, we can deduce that our operator ® must be commutative, since 
c ® b = b ® c if a : b = c : d. Moreover, we notice that the equation concerning 
the symmetry of analogy is always true because of the commutativity of the 
operators ® and =. 

Determinism: a :a==c:x=>x = c and a : b = a : x => x = b. It can be 
expressed with ®as:a®x = a®c=>* = c and a®x = b®a^x = b 

The first equation expresses the algebraic property of left regularity. Because 
of the commutativity, we can say that ® must be regular. 

Construction of an Operator and a Distance 

The alphabet as a finite group. We already know that the operator ® must be 
commutative and regular. In addition to these properties, we would like to solve 
some cases in analogy that Lepage cannot handle. 

One of these cases is to find a solution to the analogical equation: a : e == e : x, 
which can be expressed as: a ® x = e ® e. If we consider that ® is an internal 
composition operator and that e is the null element of E U {e} for ®, then we 
transform the above expression into: a(Bx = e®e = e~ x®a. 

Since every element in E U {e} has a symmetric, every equation of this form 
has a solution which is the symmetric of a. Assuming that (27U{e}, ®) is a group 
[9] is sufficient to get these properties. Moreover, this group is abelian since ® is 
commutative. 

An example: the additive cyclic group We take as an example the cyclic finite 
abelian group. The table that describes ® in this case is given in [4] and in 
Fig. 2. This algebraic structure is sufficient to solve every analogical equation 
on letters, but it is only one solution between others. This table, where each 
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line is a circular permutation of the others, brings the unique solution to every 
analogical equation on letters. The case quoted at the beginning of section 3.2, 
a : e = e : x also has a solution. 
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Fig. 2. A table for an analogical operator on an alphabet of 6 elements plus e, seen as 
the additive cyclic group C/7, and the corresponding discrete distance C. 

A distance on the additive cyclic group. We also have to build a distance 
on the alphabet, since it is necessary to compute the edit distance between 
strings of letters when using the Wagner and Fischer algorithm [6]. In the 
case of the additive cyclic group, we have a well defined table ([4] and Fig. 
2). For a quadruplet (x,y,z,t) that defines an analogy, we have the equation: 
((2; : y = z : t) <f=> {x ® t = y © z)) =>■ C x ^ y = C z ^t- This equation is coherent 
with the first characterization of a parallelogram, as given in section 3.2. Our 
aim is to guess a particular structure for the corresponding distance table, if 
there is one, by using the analogy table. Considering all the equations deduced 
from analogical equations given by this particular analogy table, we can deduce 
the distance related to the analogy defined by table. The distance table has only 
UJ different values and has a constrained structure (see [4]). 

We have a way to construct a distance table under these constraints: by us- 
ing a geometrical representation in K 2 , in which the letters are regularly placed 
on a circle (see Fig. 3) and by defining the distance between letters as the eu- 
clidian distance in M 2 . This distance table is coherent with the distance related 
to analogy that we have just defined. But, of course, other solutions can be 
devised 1 . 



3.3 Using Transducers for Computing Analogical Solutions 

We have now to find how to compose these elementary equations from the first 
three sequences of the analogical equation on sequences X : Y = Z : T , T 

1 Another useful case is when the letters of the alphabet are defined as sets of binary 

features [10]. This is for example the case of the phonemes of a language. The 
resolution is done by solving analogies between sets of features and a natural distance 
is the Hamming distance (the distance is given by the difference between phonemes 
in term of features). In that case, some analogical equations on the alphabet may 
have no solution. 
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Fig. 3. Representing (J 7 , the additive cyclic finite group with 7 elements, and defining 
a distance on Q-j. 



being the solution to find. The first problem is to characterize the relation “is 
to” in the analogical equation. Our idea is to use the notion edit distance to 
represent the transformation of a sequence into another. In this subsection, we 
will explain how to formalise this transformation with a transducer. The second 
problem is to characterize the relation “as” that represents a relation between 
the two transformations. This relation will be formalized with an automaton. 
The use of transducers is inspired from the work of Stroppa and Yvon[llJ. 

About Finite State Tranducers and Automata. A finite automaton A is 
a 5-tuple (S,Q,q°,F,S), where A is a finite alphabet, Q a finite set of states, 
q° £ Q the initial state, F C Q the set of final states, and A the transition 
function. The langage accepted by A is noted L(A)\ denoting 5* the transitive 
closure of S we have L(A) = {w,S *(q°,w) £ F}. Al denotes the canonical 
automaton for L whenever L is a rational language. 

A finite-state transducer T is a finite automaton with two tapes (an input 
and an output tape); transitions are labeled with pairs a : b with a in the input 
alphabet Si, and b in the output alphabet S 2 . In the rest of this paper, we will 
only consider the case where S\ = Si- A thorough introduction to finite-state 
transducers and their use in the context of natural language processing is given 
in [12]. 

An Edit Transducer for the Relation “Is to”. The idea is to define a 
weighted transducer that can copy with a null cost, that can substitute, insert 
or delete with costs that are defined by the distance table proposed in section 
3.2. This transducer is represented in Fig. 4. We assume that C a ^b is the cost of 
the substitution of a into b. If a is e then it concerns the insertion of b , if b is e 
then it concerns the deletion of a. 

The edit transducer is Tedit = (Si U S 2 , Q, qo, 5), the 1-state weighted 
transducer defined as: Q = {0}, I = 0, F = {0}, and 6 is defined as Va £ S 1 
and V6 £ S 2 , <5(0, a : b/C a ^b) = 0; 5(0, a : e/C a _, £ ) = 0; 5(0, e : b/C t ^b) = 0; 
Va = b, 5(0, a : 6/0) = 0. 

Let Tx(resp. Ty) be the transducer that copies X ( resp . Y). The alignment 
between two sequences X and Y is then the result of the composition T x o 
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Fig. 4. The edit transducer. 




Fig. 5. The automaton that recognizes all the alignments between bac and be. 



Ted it ° Ty- The result is very similar to the edit tables built by the Wagner 
and Fischer algorithm (Fig. 5). 



An Automaton for the Relation “as”. The next step is to use the relation 
between edit transformations to “align” the transformation of X into Y and of 
X into Z . The principle of our approach is arrange the alignments found during 
the former step so as to produce all possible solutions by solving elementary 
equations. 

The composition produces a transducer that recognizes the language of the 
alignments between two sequences. This transducer has transitions that corre- 
spond either to substitution (including identity), to deletion or insertion. Each 
state of the transducer is a compound state built from the edit transducer and 
the copy transducers. It is therefore numbered by a couple of indices taken from 
both copy transducers (see Fig. 5). 

We propose to build an automaton which recognizes the language of the 
solutions of the analogical equation. The idea is to synchronise the transitions 
in this automaton on the letters of X, the first string of the analogical equation 
X : Y = Z : T. That is why we consider that in every situation (state), we 
always are in the same position in X 1 (aligned with Y 1 ) and in X 2 (aligned 
with Z 1 ). 
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This automaton takes into account compound transducers that produces the 
alignments between A' 1 and Y 1 in one hand, and the alignments between X 2 and 
Z l in the other hand. A state of the automaton is designed by a couple of indices 
of the transducers, that is (z',j')) and the first state is ((0, 0), (0, 0)). 

Considering that we come from a state ((z, j), (z, j')) (as i = z'), we can go to 
several possible states, incrementing i or j or j' . Each transitions corresponds 
to a case where, from X to Y and from X to Z, there is a deletion on both 
sides, an insertion on one side and nothing on the other side, etc. Remark that 
going from state (( i,j ), ( i,j ')) to state ((z, j + 1), (z, j' + 1)), that would represent 
two insertions at the same time, is impossible. A transition of this automaton 
therefore corresponds to the solution of an elementary analogical equation. 

Finally the automaton that recognizes the language of analogical solutions is 
A = (A, Q, < 70 , <5), defined as: 

— Q, the set of states, numbered by the indices of states of both transducers 
that produces alignements between A' and Y and between X and Z\ 

— I = ((0, 0), (0, 0)), the initial state; 

— F = (i' ,j')} the final state where i = i' = j = j' . These indices are 

equal to the length of the optimal alignments; 

— 6 is defined as Vf £ S 

• = ((z + 1 , j), (i + l,j')), where A [i\ : e = e : i; 

• d(((i,J), (hf)),t) = ((i + 1 ,j), (*+!,/ + !)), where X[i] :e = e:t; 

• = ((hJ + !),( where e : Y\j) = e : t; 

• ( i,j')),t ) = ((*+ l,j + 1), (*+ 1, /)), where X[i] : Y[j] = e : f; 

• = ((z + l,j + l),(t + l,f + 1 )), where 

A[z] : Y\j] = Z[f] : t; 

• = (( i,j ), (■ i,j ' + 1)), where e : e = Z\j'] : t\ 

• = ((*, j + 1 ),(*,/)), where e : Y\j] =e:t. 

The solution to each elementary equation is found in the analogy table de- 
fined in section 3.2. The combination of the edit transducers and the automaton 
produce a graph with weighted edges. The path from the initial state to the 
final state that have the minimum cost describe the solutions of the analogical 
equation. 

We have proposed a direct algorithm for finding a set of solution to any 
analogical equation on sequences, based on the same concepts that we have pre- 
sented in the precedent section. This algorithm has been presented in full details 
in [4]. It can be considered as a heuristic algorithm since it only considers the 
optimal alignments to solve the analogy. On the contrary, our formal approach 
takes every possible alignment into account, solves the analogy on letters and a 
preference measure can then be applied to chose the best solutions. 

4 Related Work 

Solving analogical equations between strings has only drawn little attention in 
the past. Most relevant to our discussion are the works of Lepage, especially [13] 
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and [14], presented in full details in [1] and more recent work of Yvon[15] and 
Stroppa[ll[. 

Like in studies of Lepage, Yvon considers in [15] that comparing sequences 
for solving analogical equations must be based only on insertions and deletions 
of letters 2 , and must satisfy the axioms of Lepage. Yvon introduces the notion of 
shuffle [16, 17], and of complementary set to contruct a finite-state transducer to 
compute the set of a solutions to any analogical equation on strings. He produces 
also a refinement of this finite state machine able to rank these analogies so as 
to recover an intuitive preference towards “simple” analogies, that preserve large 
chunks of the original objects. 

Lepage [13] details the implementation of a analogy solver for words, based 
on a generalisation of the algorithm computing the longest common subsequence 
between strings. Given the equation u : v = w : x , this algorithm proceeds 
as follows: it alternatively scans v and w for symbols in u\ when one is found, 
say in w , it outputs in x the prefix of ui which did not match, while extending 
as much as possible the current match; then exchange the roles of v and w. If u 
has been entirely matched, then output if necessary in x the remaining suffixes 
of v and ui; otherwise the analogical equation does not have any solution. 

In the framework of Yvon, our approach is equivalent to computing the 
longest common subsequence (and not all the common subsequences). Thus we 
can consider that we propose an extension that uses one supplementary editing 
operator: the substitution of a letter by a different letter. Moreover, Yvon says 
himself in [15] that his work is very similar to Lepage’s work. 

5 Conclusion and Extension 

In this paper, we propose a new approach for solving analogy between sequences 
as a first step to learning by analogy. This approach is built on a formal frame- 
work based on transducers using edit distance in the “is to” relation and an 
automaton for solving elementary equations on letters. 

We have given an algebraic structure so as to get a unique solution of all pos- 
sible combinations of analogical equations with symbols chosen in an alphabet. 
We have shown that a finite additive cyclic group is a sufficient structure and 
we give an example of the corresponding operator ® on a given set £U{e}. We 
also have shown how to build a distance on this group, needed for alignments at 
the first step of the algorithm. 

An extension to this work could focus on several points. Firstly we could 
examine if there less restrictive algebraic forms than the cyclic finite group for 
respecting the desired properties of an analogy. We could also study the influ- 
ence of the analogical operator on the distance table. For example, is there an 
algorithm to give a general solution for building a distance for every alphabet 
given an analogical table? Is there an algorithm to transform an a priori distance 
into an analogical distance or to built an analogical distance that is close to the 
given distance? Finally, we have to focus on learning by analogy. 

2 This assumption is called the inclusion property. 
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Abstract. Powerful methods and algorithms are known to learn regu- 
lar languages. Aiming at extending them to more complex grammars, 
we choose to change the way we represent these languages. Among the 
formalisms that allow to define classes of languages, the one of string- 
rewriting systems (SRS) has outstanding properties. Indeed, SRS are ex- 
pressive enough to define, in a uniform way, a noteworthy and non trivial 
class of languages that contains all the regular languages, {a n b n : n > 0}, 
{u> £ {a, b}* : |u>| 0 = |w|f>}, the parenthesis languages of Dyck, the lan- 
guage of Lukasewitz, and many others. Moreover, SRS constitute an 
efficient (often linear) parsing device for strings, and are thus promising 
and challenging candidates in forthcoming applications of Grammatical 
Inference. In this paper, we pioneer the problem of their learnability. We 
propose a novel and sound algorithm which allows to identify them in 
polynomial time. We illustrate the execution of our algorithm through- 
out a large amount of examples and finally raise some open questions 
and research directions. 

Keywords: Learning Context-Free Languages, Rewriting Systems. 



1 Introduction 

Whereas for the case of learning regular languages there are now a number of 
positive results and algorithm, things tend to get harder when the entire class of 
context-free languages is considered [10, 17]. Typical approaches have consisted 
in learning special sorts of grammars [20], by using genetic algorithms or arti- 
ficial intelligence ideas [16], and by compression techniques [13]. Yet more and 
more attention has been drawn to the problem: One example is the Omphalos 
context-free language learning competition [19]. 

An attractive alternative when blocked by negative results is to change the 
representation mode. In this line, little work has been done for the context-free 
case: One exception is pure context-free grammars which are grammars where 
both the non-terminals and the terminals come from a same alphabet [8] . 

* This work was supported in part by the 1ST Programme of the European Commu- 
nity, under the PASCAL Network of Excellence, IST-2002-506778. This publication 
only reflects the authors’ views. 
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In this paper, we investigate string-rewriting systems (SRS). Invented in 
1914 by Axel Thue, the theory of SRS (also called semi-Thue systems) and its 
extension to trees and to graphs was paid a lot of attention all along the 20 th 
century (see [1, 3]). Rewriting a string consists in replacing substrings by others, 
as far as possible, following laws called rewrite rules. For instance, consider 
strings made of a and b, and the single rewrite rule ab — » A. Using this rule 
consists in replacing a substring ab by the empty string, thus in erasing ab. It 
allows to rewrite abaabbab as follows: 

abaabbab — » abaabb — > abab — » ab — > A 

Other rewriting derivations may be considered but they all lead to A. Actually, 
it is rather clear on this example that a string will rewrite to A iff it is a 
“parenthetic” string, i.e., a string of the Dyck language. More precisely, the 
Dyck language is completely characterized by this single rewrite rule and the 
string A, which is reached by rewriting all other strings of the language. This 
property was first noticed in a seminal paper by Nivat [14] which was the starting 
point of a large amount of work during the three last decades. 

We use this property, and others to introduce a class of rewriting systems 
which is powerful enough to represent in an economical way all regular languages 
and some typical context-free languages: {a n b n : n > 0}, {tu G {a, 6}* : |w| a = 
|w|b}, the parenthesis languages of Dyck, the language of Lukasewitz, and many 
others. We also provide a learning algorithm called LARS (Learning Algorithm 
for Rewriting Systems) which can learn systems representing these languages 
from string examples and counter-examples of the language. 

In section 2 we give the general notations relative to the languages we consider 
and discuss the notion of learning. We introduce our rewriting systems and their 
expressiveness in section 3 and develop the properties they must fulfill to be 
learnable in section 4. The general learning algorithm is presented and justified 
in section 5. We report in section 6 some experimental results and conclude. 

2 Learning Languages 

An alphabet £ is a finite nonempty set of symbols called letters. A string w over 
A is a finite sequence w = a\a 2 ■ ■ ■ a n of letters. Let |w| denote the length of 
w. In the following, letters will be indicated by a, 6, c, . . ., strings by u, v, . . . , z, 
and the empty string by A. Let £* be the set of all strings. We assume a fixed 
but arbitrary total order < on the letters of £. As usual, we extend < to £* by 
defining the hierarchical order [15], denoted <1, as follows: 

{ \wi\ < \W 2 1 or 

|wi| = \w 2 1 and 3u,v\,V2 G £*,3x\,X2 G £ 
such that w\ = ux\V\,W 2 = ux 2 V 2 and X\ < X2- 

By a language we mean any subset L C £* . Many classes of languages 
were investigated in the literature. In general, the definition of a class L re- 
lies on a class R. of abstract machines, here called representations, that char- 
acterize all and only the languages of L: (1) Vi? G R,£(i?) G L and (2) ML G 
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L, 3 R £ R such that £(R) = L. Two representations R\ and R 2 are equivalent 
iff £(R\) = C(R. 2 )- In this paper, we will investigate the class REG of regular 
languages characterized by the class DFA of deterministic finite automata ( dfa ), 
and the class CFL of context-free languages represented by the class CFG of 
context-free grammars ( cfg ). 

We now turn to our learning problem. The size of a representation R , denoted 
by ||i?||, is polynomially related to the size of its encoding. 

Definition 1. Let L be a class of languages represented by some class R. 

1. A sample S for a language L € L is a finite set of ordered pairs ( w , labeKw )) € 
£* x {+, — } such that if labeKw) = + then w £ L and if labeKw ) = — then 
w ^ L. The size of S is the sum of the lengths of all strings in S. 

2. An (L, R ) -learning algorithm is a program that takes as input a sample of 
labeled strings and outputs a representation from R. 

Finally, let us recall what “learning” means. We choose to base ourselves 
on the paradigm of polynomial identification, as defined in [6,2], since many 
authors showed that it was both relevant and tractable. Other paradigms are 
known ( e.g . PAC-learnability) , but they are often either similar to this one or 
inconvenient for Grammatical Inference problems. 

In this paradigm we first demand that the learning algorithm has a run- 
ning time polynomial in the size of the data from which it is learning from. 
Next we want the algorithm to converge in some way to a chosen target. Ideally 
the convergence point should be met very quickly, after having seen a polyno- 
mial number of examples. As this constraint is usually too hard, we want the 
convergence to take place in the limit, i.e., after having seen a finite number 
of examples. The polynomial aspects then correspond to the size of a minimal 
learning or characteristic sample, whose presence should ensure identification. 
For more details on these models we refer the reader to [6, 2]. 

3 Defining Languages with String-Rewriting Systems 

String-rewriting systems are usually defined as sets of rewrite rules. These rules 
allow to replace factors by others in strings. However, as we feel that this mech- 
anism is not flexible enough, we would like to extend it. Indeed, a rule that one 
would like to use at the beginning (prefix) or at the end of a string could also 
be used in the middle of this string and then have undesirable side effects. 

Therefore, we introduce two new symbols $ and £ that do not belong to 
the alphabet £ and will respectively mark the beginning and the end of each 
string. In other words, we are going to consider strings from the set %£* £. As 
for the rewrite rules, they will be partially marked (and thus belong to £* = 
(A + $) £* (A + £)). Their forms will constrain their uses either to the beginning, 
or to the end, or to the middle, or to the string taken as a whole. Notice that this 
solution is an intermediate approach between the usual one and string-rewriting 
systems with variables introduced in [11]. 
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Definition 2 (Delimited SRS). 

— A delimited rewrite rule is an ordered pair of strings (l,r), generally written 
l — » r, such that l and r satisfy one of the four following constraints: 

1. l,r £ $E* (used to rewrite prefixes) or 

2. l,r € %E* £ (used to rewrite whole strings) or 

3. l,r £ E* (used to rewrite factors) or 
4- l,r £ E* £ (used to rewrite suffixes). 

Rules of type 1 and 2 will be called $-rules and rules of type 3 and 4 will be 
called non-%-rules. 

— By a delimited string-rewriting system (DSRS), we mean any finite set TZ of 
delimited rewrite rules. 

Let | TZ | be the number of rules of 1Z and ||7£|| the sum of the lengths of the 
strings 1Z is made of: ||7£|| = J2(i^ r )e n V r \- 

Given a DSRS 7 Z and two strings u>i,u >2 £ E*, we say that w\ rewrites in 
one step into u> 2 , written w\ u >2 or simply w± —■ > W 2 , iff there exists a rule 
(l — ■> r) £ TZ and two strings u, v £ E* such that w i = ulv and W 2 = urv. 
A string w is reducible iff there exists w' such that w — + w' , and irreducible 
otherwise. E.g ., the string %aabb£ rewrites to %aaa£ with rule bb£ — + a£ and 
%aaa£ is irreducible. We get immediately the following property: 

Proposition 1. The set.%E*£ is stable w.r.t. 

In other words, $ and £ cannot disappear or move in a string by rewriting. 

Let — (or simply — >*) denote the reflexive and transitive closure of —>k- 
We say that w\ reduces to W 2 or that W 2 is derivable from W\ iff w i 

Definition 3 (Language Induced by a DSRS). Given a DSRS 7 Z and an 

irreducible string e £ E* , we define the language C{1 Z,e) as the set of strings 
that reduce to e using the rules ofTZ: 

C{TZ, e) = {w £ E* : $w£ %e£}. 

Deciding whether a string w belongs to a language C{TZ,e) or not consists in 
trying to obtain e from w by a rewriting derivation. However, w may be the 
starting point of numerous derivations and so, such a task may be really hard. 
(Nevertheless, remember that we introduced $ and £ to allow some control. . . ) 
We will tackle these problems in next section but present some examples first. 

Example 1. Let E = {a, b}. 

— C({ab — > A}, A) is the Dyck language. Indeed, this single rule erases factors 
ab , so we get the following example of derivation: 

$aabbab£ — > $aabb£ — > %ab£ — > $£ 

— C({ab —+\-,ba—+ A}, A) is the language {w £ E* : |ro| a = since every 

rewriting step erases one a and one b. 
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— C({aabb — » cib\ %ab£ — » $£}, A) = {a n b n : n € N}. For instance, 

%aa aabb bb£ — » $a aabb b£ — > % aabb £ — » $a6£ — > $£ 

Notice that the rule $a&£ — + $£ is necessary for A to belong to the language. 

— £({$a& — > $}, A) is the regular language (ab)*. Indeed, 

%ab abab£ — > S ab ab£ — > $ab £ — » $£ 

Actually, all regular languages can be induced by a DSRS: 

Theorem 1. For each regular language L, there exist a DSRS 1Z and a string 
e such that L = £(1Z, e). 

Proof (Hint). A DSRS that is only made of $-rules defines a prefix grammar [5]. 
It has been shown that this kind of grammars generates exactly the regular 
languages. 

4 Shaping Learnable DSRS 

As already mentioned, a string w belongs to a language £(7 Z, e ) iff one can build 
a derivation from w to e. However this raises many difficulties. Firstly, one can 
imagine a DSRS such that a string can be rewritten indefinitely 1 . In other words, 
an algorithm that would try to answer the problem may loop. Secondly, even if 
all the derivations induced by a DSRS are finite, they could be of exponential 
lengths and thus computationally intractable 2 . 

We first extend the hierarchical order < to the strings of E*, by defining 
the extended hierarchical order , denoted A, as follows: Viui,t 02 G E* , if Wi < 
w 2 then w i A $w± A wi£ A $w \£ A W 2 ■ Therefore, if a < 6, then A <1 a < b < 
aa<iab<\ba<\bb<\aaa <\. . ., so A A $ A £ A $£ A a A $a A a£ A $a£ A b A . . . 
The following technical definition ensures that all the rewriting derivations are 
finite and tractable in polynomial time. 

Definition 4 (Hybrid DSRS). We say that a rule l — > r is 
(i) length-reducing iff |/| > |r| and (ii) length-lexicographic ifflyr. 

A DSRS 1Z is hybrid iff (i) all %-rules (whose left hand sides are in $A'*(A+ £) ) 
are length-lexicographic and (ii) all non-%-rules (whose left, hand sides are in 
E* (A + £)) are length-reducing. 

Theorem 2. All the derivations induced by a hybrid DSRS 7 Z are finite. More- 
over, every derivation starting from a string w has a length that is < |w| • \TZ\. 

1 Consider the derivations induced by {a —> b\ b — > a; c — > cc }. . . 

2 Consider the DSRS {1£ — » 0£',0£ — * cld£;0c — > cl; lc — > Od; dO — > Od;dl — > 

Id; dd — > A}. All the derivations it induces are finite; Indeed, assuming that d > 1 > 
0 > c, the left hand side l is lexicographically greater than the right hand side r for 
all rules l — > r, so this DSRS is strongly normalizing [3]. However, it induces the 
derivation $1111£ ->• $1110£ $1101£ ->• $1100£ $1011£ $0000£ 
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Proof. Let wi — > W2 be a single rewriting step. There exists a rule l — » r and 
two strings u, v € E* such that w\ = ulv and W2 = urv. Notice that if | Z | > |r| 
then l >- r. Moreover, if l y r, then we deduce that w± >- 11J2 ■ So if one has a 
derivation w — * u\ — » 112 — > . . then w >- u\ >- U2 >~ ■ ■ ■■ As A is a good order, 
there is no infinite and strictly decreasing chain of the form w >- u\ >- 112 >- . . .. 
So every derivation induced by 1Z is finite. Now let n £ N. Assume that for all 
strings w' such that \w'\ < n, the lengths of the derivations starting from w' are 
at most |u/| • \TZ\. Let w be a string of length n. We claim that the maximum 
length of a derivation that would preserve the length of w cannot exceed \R\ 
rewriting steps. Indeed, all rules that can be used along such a derivation are of 
the form $1 — > $r, with |/| = |r| and l >- r; When such a rule is used once, then 
it cannot be used a second time in the same derivation. Otherwise, there would 
exists a derivation %lu£ — > %ru£ %lv£ with |u| = |u| (since the length 

is preserved). As %ru£ —>* $lv£ and |/| = |r| and |u| = |u|, we deduce that r A l 
which is impossible since r -< l. So there are at most \R\ rewriting steps that 
preserve the length of w, and then the application of a rule produces a string 
w' whose length is < n. So by induction hypothesis, the length of a derivation 
starting from w is no more than \TZ\ + |u/| • \TZ\ < |w| • \1Z\. □ 

We saw that a hybrid DSRS induces finite and tractable derivations. Never- 
theless, many different irreducible strings may be reached from one given string 
by rewriting. Therefore, answering the problem “w G C{ 1Z, e)?” will require to 
compute all the derivations that start with w and check if one of them ends with 
e. In other words, such a DSRS is a kind of “undeterministic” (thus inefficient) 
parsing device. An usual way to circumvent this difficulty is to impose our hybrid 
DSRS to be also Church- Rosser [3]. 

Definition 5 (Church- Rosser DSRS). We say that a DSRS 7 Z is Church- 
Rosser iff for all strings w,ui,U2 € E* such that w — >* u\ and w — »* 112, there 
exists w' G E* such that u\ — >* w' and U2 — »* w' . 

In the definition above, if w — »* u\ and w — >* U2 and u\ and U2 are irreducible 
strings, then u\ = «2(= w'). So given a string w, there is no more than one 
irreducible string that can be reached by a derivation starting with w, whatever 
the derivation is considered. However, the Church- Rosser property is undecidable 
in general [3], so we constrain our DSRS to fulfill a restrictive condition: 

Definition 6 (ANo DSRS). A DSRS 1Z is almost nonoverlapping (ANo) iff 
for all rules R\ = l\ — » rq and R2 = I2 — * ^2 oflZ: 

i. if h = I2 then rq = r2 ; 

ii. if h is strictly included in I2: 3u, v € E*,ul\v = l2,uv ^ A, then ur\v = r2,' 
Hi. if a strict suffix of 1 1 is a strict prefix of I2: 

3 u, v € E* , Ipu = VI2, 0 < |u| < |Zi|, then r\u = vr2- 

Notice that if R\ does not overlap f?2, then R 2 may still overlap R\. 

Theorem 3. Every ANo DSRS is Church- Rosser. Moreover, every subsystem 
of an ANo DSRS is an ANo DSRS, and thus Church- Rosser. 
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Proof. Let us show that an ANo DSRS 7 Z induces a rewriting relation — that 
is subcommiLtative [ 7 ]. Let us write w± — W2 iff w\ W2 or w\ = W2 ■ We 
claim that for all w, U\,U2, if w u\ and w U2, then there exists a string 
w' such that Mi — > e w' and 112 — w' . Indeed, assume that w u\ uses a rule 
i?i = h —■ > ri and w —>•/*. 112 uses a rule R2 = I2 — > ^2- If both rewriting steps 
are independent, i.e., w = xliyfoz for some strings x,y,z, then u\ = xr\yl2Z 
and U2 = xl\yr2Z\ Obviously, u\ —y-R. w' and 112 —>7?. w' with w' = xr\yr2Z. 
Otherwise, R\ overlaps R2 (or vice-versa), and so u\ = U2, since 7 Z is ANo. An 
easy induction allows to generalize this property to derivations: If w — iq and 
w — *tz u 2 then there exists w' such that u\ —y* w' and U2 — w' , where — >* is 
the reflexive and transitive closure of — » e . Finally, as u\ — >* w' and U2 — w' , 
we deduce that U\ — w' and 112 W. □ 

Finally, we get the following properties with our DSRS: ( 1 ) For all strings w, 
there is no more than one irreducible string that can be reached by a derivation 
which starts with w, whatever the derivation is considered. This irreducible 
string will be called the normal form of w and denoted w ( 2 ) No derivation 
can be prolonged indefinitively, so every string w has at least one normal form. 
And whatever the way a string w is reduced, the rewriting steps produce strings 
that are ineluctably closer and closer to w |. An important consequence is that 
one has an immediate algorithm to check whether w € C{TZ , e) or not: One 
only needs to (i) compute the normal form w | of w and (ii) check if w | and 
e are syntactically equal. As all the derivations have polynomial lengths, this 
algorithm is polynomial in time. 

5 Learning Languages Induced by DSRS 

In this section we present our learning algorithm and its properties. The idea is 
to enumerate the rules following the order A. We discard those that are useless 
or inconsistent w.r.t. the data, and those that break the ANo condition. 

The first thing LARS does is to compute all the factors of 5 + and to sort 
them w.r.t. A. Left and right hand sides of the rules will be chosen in this set since 
it is reasonable to think that the positive examples contain all information that 
is needed to learn the target language. This assumption reduces dramatically 
the search space. LARS then enumerates the elements of this set thanks to two 
“for” loops, which allows to build the candidate rules. 

Function is_useful discards the rules that cannot be used to rewrite at least 
one string of the current set 1 + (and are thus useless). Function type returns 
an integer in { 1 , 2 , 3 , 4 } and allows to check if the candidate rule is syntactically 
correct according to Def. 2 . Function is_ANo avoids the rules that would produce 
non ANo DSRS. Notice that a candidate rule that passes all these tests with 
success ensures that the DSRS will be syntactically correct, hybrid and ANo. 
The last thing to check is that the rule is consistent with the data, i.e., that it 
does not produce a string belonging to both I + and /_ . This is easily performed 
by computing the normal forms of the strings of /+ and /_ , which is the aim of 
function normalize. 
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Algorithm 1: LARS (Learning Algorithm for Rewriting Systems) 

Data : a sample ( S+,S _) 

Result : (72., e ) where 72- is a hybrid ANo DSRS and e is an irreducible string 

begin 

77 0 ; /+ *— S+; /_ _S_; 

F < — sort^ {v : 3u,w £ E*, uvw £ /+}; 

for i = 1 to |.F| do 

if is_useful(7 ? [i],/+) then 
for j = 0 to i — 1 do 

if type(F[i],) = typ e(F\j\) then 
5 <— 72 U {F\i] -> F[j]}; 
if is_ANo(iS) then 

E+ < normalize(/+,5); E- < normalize)/- ,<S); 

if E+ n E- = 0 then 
I 77 <— S; 1+ <— S+; I- ^E-- 



e < min^/+; 

foreach w £ 1+ do 

| if w ^ e then 72 * — 72- U {w — > e}; 
return (72-, e); 
end 



Theorem 4. Given a sample (S+,SL) of size m, algorithm LARS returns a 
hybrid ANo DSRS 1Z and an irreducible string e such that S+ C £(72., e) and 
S'- fl£(72,e) = 0. Moreover, its execution time is a polynomial of m. 

Proof (Hint). The termination and polynomiality of LARS is straightforward. 
Moreover, the following four invariant properties are maintained all along the 
double “for” loops: (1) 72 is a hybrid ANo DSRS, (2) I + contains all and only 
the normal forms of the strings of S + w.r.t. 72, (3) /_ contains all and only the 
normal forms of the strings of S_ w.r.t. 72 and (4) 1+ n/_ =0. Clearly, these 
properties remain true before the “foreach” loop. Now at the end of the last 
“foreach” loop, it is clear that: (1) 72 is a hybrid ANo DSRS, (2) e is the normal 
form of all the strings of S+, so S + C £(72, e) and (3) the normal forms of the 
strings of 5_ are all in /_ and e ^ so S_ D £(72, e) = 0. □ 

We now establish an identification theorem for LARS. This theorem focuses 
on languages that may be defined thanks to special DSRS that we define now. 
We begin with the notion of consistent rule that characterizes the rules that 
LARS will have to find. 

Definition 7 (Consistent Rule). We say that a rule R= l — > r is consistent 
w.r.t. a language L C £* iff Mu, v £ E* , if ulv ^ %L£, then urv $L£. 







